Le mar. 31 août 2021 à 23:37, Linas Vepstas <linasveps...@gmail.com> a écrit :
>
> Hi Amirouche,
>
> Here: https://linas.org/datasets/
>
> It has the bio dataset, I'll add the language dataset shortly.  The README 
> explains:
>
> * `mozi-data.scm.bz2` -- uncompressed, its 371485784 bytes.
>   Contains the December 2019 version of the small public version of
>   the MOZI dataset.  This is just genetic and proteomic data from
>   popular public datasets, converted to Atomese s-expressions.
>   Stats are:
>   `((ConceptNode . 454779) (PredicateNode . 12) (ListLink . 1925554) 
> (MemberLink . 1850528) (AndLink . 98788) (EvaluationLink . 1887530) 
> (InheritanceLink . 122184) (GeneNode . 49050) (MoleculeNode . 368909))`
>   so that's about 7 million atoms (It's 6757335 to be precise); loading
>   it into the AtomSpace results in an RSS of about 4.3 GBytes RAM.
>   That's about 632 bytes/atom when in RAM.  It's just pure Atoms, no
>   Values.  Compare to 55 Bytes/atom when stored as uncompressed s-exprs,
>   or about 4 Bytes/atom when bzip2-compressed.  Clearly, indexes are
>   expensive!
>
> --linas
>
>
> On Mon, Aug 30, 2021 at 5:48 AM Amirouche Boubekki 
> <amirouche.boube...@gmail.com> wrote:
>>
>> Le dim. 29 août 2021 à 23:59, Linas Vepstas <linasveps...@gmail.com> a écrit 
>> :
>> >
>> > Hi Amirouche,
>> >
>> > On Thu, Aug 19, 2021 at 1:20 AM Amirouche Boubekki 
>> > <amirouche.boube...@gmail.com> wrote:
>> >>
>> >>
>> >> If you deliver a set of json or sexp files that is relevant to
>> >> opencog, I think about one terabyte or something like that, I can
>> >> demonstrate a JSON / s-exp database.
>> >
>> >
>> > I've been out of town. I can send you two. One will be a dump of (a 
>> > portion of) the agi-bio dataset. That dataset is itself just an import 
>> > into the atomspace of assorted external gene and protein databases.  It's 
>> > just "pure" s-expressions, no truth values or counts on them.  It's not a 
>> > terabyte, its probably much smaller than a gigabyte (I'll find out shortly)
>> >
>> > The other will be a natural language dataset. Here, each s-exp will have a 
>> > numerical count on it.  It's the counts that matter.  I have small, 
>> > medium, large versions of this. I'll send the small one, no point in 
>> > struggling with something huge.
>>
>> That is wiser. Let me know where I can fetch the data, and whether the
>> server must be behind a login and password. My server is located in
>> Helsinki in Finland, and it is not encrypted so better keep secrets
>> away from it. I think it will be easier for me to make sense of the
>> natural language data, but anything sexp should do.
>>
>> >
>> > The format will be "Atomese": Atoms in s-expressions are globally unique 
>> > and immutable and indexed (thus, searchable). Values in s-expressions are 
>> > fleeting, ephemeral, subject to change, and not indexed (thus, not 
>> > searchable)
>> >

Thanks for the quick reply. I spent 3 weeks coding in the wrong
direction (!), I need to rest a bit. I will let you know when I have
something usable.

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to opencog+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAL7_Mo8NOB4%2BrwQgCg77M3W4N%2BL%3Dq8xwXEtWONVsw1w43WzH3w%40mail.gmail.com.

Reply via email to