Hi Amirouche,

Here: https://linas.org/datasets/

It has the bio dataset, I'll add the language dataset shortly.  The README
explains:

* `mozi-data.scm.bz2` -- uncompressed, its 371485784 bytes.
  Contains the December 2019 version of the small public version of
  the MOZI dataset.  This is just genetic and proteomic data from
  popular public datasets, converted to Atomese s-expressions.
  Stats are:
  `((ConceptNode . 454779) (PredicateNode . 12) (ListLink . 1925554)
(MemberLink . 1850528) (AndLink . 98788) (EvaluationLink . 1887530)
(InheritanceLink . 122184) (GeneNode . 49050) (MoleculeNode . 368909))`
  so that's about 7 million atoms (It's 6757335 to be precise); loading
  it into the AtomSpace results in an RSS of about 4.3 GBytes RAM.
  That's about 632 bytes/atom when in RAM.  It's just pure Atoms, no
  Values.  Compare to 55 Bytes/atom when stored as uncompressed s-exprs,
  or about 4 Bytes/atom when bzip2-compressed.  Clearly, indexes are
  expensive!

--linas


On Mon, Aug 30, 2021 at 5:48 AM Amirouche Boubekki <
amirouche.boube...@gmail.com> wrote:

> Le dim. 29 août 2021 à 23:59, Linas Vepstas <linasveps...@gmail.com> a
> écrit :
> >
> > Hi Amirouche,
> >
> > On Thu, Aug 19, 2021 at 1:20 AM Amirouche Boubekki <
> amirouche.boube...@gmail.com> wrote:
> >>
> >>
> >> If you deliver a set of json or sexp files that is relevant to
> >> opencog, I think about one terabyte or something like that, I can
> >> demonstrate a JSON / s-exp database.
> >
> >
> > I've been out of town. I can send you two. One will be a dump of (a
> portion of) the agi-bio dataset. That dataset is itself just an import into
> the atomspace of assorted external gene and protein databases.  It's just
> "pure" s-expressions, no truth values or counts on them.  It's not a
> terabyte, its probably much smaller than a gigabyte (I'll find out shortly)
> >
> > The other will be a natural language dataset. Here, each s-exp will have
> a numerical count on it.  It's the counts that matter.  I have small,
> medium, large versions of this. I'll send the small one, no point in
> struggling with something huge.
>
> That is wiser. Let me know where I can fetch the data, and whether the
> server must be behind a login and password. My server is located in
> Helsinki in Finland, and it is not encrypted so better keep secrets
> away from it. I think it will be easier for me to make sense of the
> natural language data, but anything sexp should do.
>
> >
> > The format will be "Atomese": Atoms in s-expressions are globally unique
> and immutable and indexed (thus, searchable). Values in s-expressions are
> fleeting, ephemeral, subject to change, and not indexed (thus, not
> searchable)
> >
> > --linas
>
> --
> Amirouche ~ https://hyper.dev
>
> --
> You received this message because you are subscribed to the Google Groups
> "opencog" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to opencog+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/opencog/CAL7_Mo9B-ZQMgCbyTAcQL23PwX50w-qwYqSekdRdaHP0ryGchQ%40mail.gmail.com
> .
>


-- 
Patrick: Are they laughing at us?
Sponge Bob: No, Patrick, they are laughing next to us.

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to opencog+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA3601KoH8mmyUv-2XVjTNBHJ-R-S6kY65rdWLksU%3DtMDyA%40mail.gmail.com.

Reply via email to