Re: Strategies for loading large (>500m triples) datasets

Rob Styles Thu, 01 Mar 2012 03:28:11 -0800

Hi Glenn,

How big is the data on disk in total? About 22Gb of RDF/XML?


If I were doing this I would convert to ntriples which you can do with
something like:

find /source -type f -exec rapper -i rdfxml -o ntriples {} >>
/destination/big.rdf.nt \;

Then I'd load directly from that, or split that into a smaller number of
chunks.

rob


On Thu, Mar 1, 2012 at 11:21 AM, Glenn Proctor <
[email protected]> wrote:

> Hi Andrea
>
> It's the PDB dataset, the files are about 300k on average, although
> some are as big as 5Mb. I have played around with rapper and was
> considering using ntriples files since they are more amenable to
> simple command-line manipulation, so it's something I'll definitely
> bear in mind.
>
> Thanks
>
> Glenn.
>
>
> On Thu, Mar 1, 2012 at 10:29 AM, Andrea Splendiani
> <[email protected]> wrote:
> > Hi,
> >
> > just a question: how big are (on average) your files ?
> > I have been dealing with large RDF files (like Uniprot) that comes in
> "many"
> > small files.
> > One way to go around the file number/size is to convert everything in
> > m-triples (like with rapper,very fast). Then you can cut and merge files
> on
> > the command line quite efficiently.
> >
> > best,
> > Andrea
> >
> > Il giorno 01/mar/2012, alle ore 09.53, Glenn Proctor ha scritto:
> >
> >> Hi
> >>
> >> I have a dataset which I would like to load into a TDB instance, and
> >> access with Fuseki. Certain aspects of the dataset and TDB make this
> >> challenging.
> >>
> >> The dataset (provided by someone else) is split into about 80000
> >> individual RDF/XML files, all in one directory by default. This makes
> >> even straightforward directory listings slow, and means that the
> >> command-line length/maximum number of arguments is exceeded, so I
> >> can't just refer to *.rdf.
> >>
> >> My first approach has been to create separate directories, each with
> >> about a third of the files, and use tdbloader to load each group in
> >> turn. I gave tdbloader 6Gb of memory (of the 7Gb available on the
> >> machine) and it took four hours to load and index the first group of
> >> files, a total of 207m triples in total. As Andy mentioned in a thread
> >> yesterday, the triples/sec count gradually declined over the course of
> >> the import (from about 30k/sec to 24k/sec).
> >>
> >> However when I tried to use tdbloader to load the next group of files
> >> into the same TDB, I found that performance declined dramatically -
> >> down to about 400 triples/sec right from the start. Is this expected
> >> behaviour? I wonder if it's because it's trying to add new data to an
> >> already indexed set - is this the case, and if so is there any way to
> >> improve the performance? Coming from a relational database background,
> >> my instinct would be to postpone indexing until all the triples were
> >> loaded (i.e. after the third group of files was imported), however I
> >> couldn't see any options affecting the index creation in tdbloader.
> >>
> >> Another question is whether the strategy I've adopted (i.e. loading 3
> >> groups of ~27k files consecutively) is the correct one. The
> >> alternative would be to merge all 80k files into one in a separate
> >> step, then load the resulting humongous file. I suspect that there
> >> would be different issues with that approach.
> >>
> >> Is TDB even appropriate for this? Would (say) a MySQL-backed SDB
> >> instance be better? Or three separate TDB instances? Obviously the
> >> later would require some sort of query federation layer.
> >>
> >> I'm relatively new to this whole area so any tips on best practice
> >> would be appreciated.
> >>
> >> Regards
> >>
> >> Glenn.
> >
> >
>

Re: Strategies for loading large (>500m triples) datasets

Reply via email to