Le mar. 31 août 2021 à 23:37, Linas Vepstas <linasveps...@gmail.com> a écrit : > > Hi Amirouche, > > Here: https://linas.org/datasets/ > > It has the bio dataset, I'll add the language dataset shortly. The README > explains: > > * `mozi-data.scm.bz2` -- uncompressed, its 371485784 bytes. > Contains the December 2019 version of the small public version of > the MOZI dataset. This is just genetic and proteomic data from > popular public datasets, converted to Atomese s-expressions. > Stats are: > `((ConceptNode . 454779) (PredicateNode . 12) (ListLink . 1925554) > (MemberLink . 1850528) (AndLink . 98788) (EvaluationLink . 1887530) > (InheritanceLink . 122184) (GeneNode . 49050) (MoleculeNode . 368909))` > so that's about 7 million atoms (It's 6757335 to be precise); loading > it into the AtomSpace results in an RSS of about 4.3 GBytes RAM. > That's about 632 bytes/atom when in RAM. It's just pure Atoms, no > Values. Compare to 55 Bytes/atom when stored as uncompressed s-exprs, > or about 4 Bytes/atom when bzip2-compressed. Clearly, indexes are > expensive! > > --linas > > > On Mon, Aug 30, 2021 at 5:48 AM Amirouche Boubekki > <amirouche.boube...@gmail.com> wrote: >> >> Le dim. 29 août 2021 à 23:59, Linas Vepstas <linasveps...@gmail.com> a écrit >> : >> > >> > Hi Amirouche, >> > >> > On Thu, Aug 19, 2021 at 1:20 AM Amirouche Boubekki >> > <amirouche.boube...@gmail.com> wrote: >> >> >> >> >> >> If you deliver a set of json or sexp files that is relevant to >> >> opencog, I think about one terabyte or something like that, I can >> >> demonstrate a JSON / s-exp database. >> > >> > >> > I've been out of town. I can send you two. One will be a dump of (a >> > portion of) the agi-bio dataset. That dataset is itself just an import >> > into the atomspace of assorted external gene and protein databases. It's >> > just "pure" s-expressions, no truth values or counts on them. It's not a >> > terabyte, its probably much smaller than a gigabyte (I'll find out shortly) >> > >> > The other will be a natural language dataset. Here, each s-exp will have a >> > numerical count on it. It's the counts that matter. I have small, >> > medium, large versions of this. I'll send the small one, no point in >> > struggling with something huge. >> >> That is wiser. Let me know where I can fetch the data, and whether the >> server must be behind a login and password. My server is located in >> Helsinki in Finland, and it is not encrypted so better keep secrets >> away from it. I think it will be easier for me to make sense of the >> natural language data, but anything sexp should do. >> >> > >> > The format will be "Atomese": Atoms in s-expressions are globally unique >> > and immutable and indexed (thus, searchable). Values in s-expressions are >> > fleeting, ephemeral, subject to change, and not indexed (thus, not >> > searchable) >> >
Thanks for the quick reply. I spent 3 weeks coding in the wrong direction (!), I need to rest a bit. I will let you know when I have something usable. -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to opencog+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAL7_Mo8NOB4%2BrwQgCg77M3W4N%2BL%3Dq8xwXEtWONVsw1w43WzH3w%40mail.gmail.com.