Re: TDB optimization query

Amandeep Srivastava Thu, 28 Nov 2019 02:29:35 -0800

Correction, using

tdb2.tdbloader - -loader=parallel - - loc=../../db .. /.. /file.nt


On Thu, 28 Nov, 2019, 3:56 PM Amandeep Srivastava, <
[email protected]> wrote:

> Yes, I have the jena-log4j.properties file within the jena repo and the
> TDB2.loader file under bin in same repo.
>
> For me, when I run tdb2.loader - - loc=.. /.. /db .. /../file.nt, I see no
> logs. The process starts consuming cores and ram but there's nothing on the
> console. When the loading is finished, cursor moves on to the next line.
>
> On Thu, 28 Nov, 2019, 3:48 PM Andy Seaborne, <[email protected]> wrote:
>
>>
>>
>> On 28/11/2019 05:44, Amandeep Srivastava wrote:
>> > Thanks Andy, setting it that way worked.
>> >
>> > Also, can we turn on the verbose logging in TDB2.loader like we have in
>> > tdbloader2?
>> >
>> > Basically, giving an output of how many triples it's loading and how
>> much
>> > time has elapsed so far.
>>
>> It does that by default for the data phase. The report step size is
>> longer (500k) than TDB1
>>
>> The index phase is more parallel and not all mods report progress.
>>
>> What are you seeing?
>> (Do you have a log4j.properties in the current directory?)
>>
>>      Andy
>>
>>
>>
>>
>> tdb2.tdbloader --loader=parallel --loc DB2 ~/Datasets/BSBM/bsbm-5m.nt.gz
>>
>> INFO  Loader = LoaderParallel
>> INFO  Start: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz
>> INFO  Add: 500,000 bsbm-5m.nt.gz (Batch: 124,875 / Avg: 124,875)
>> INFO  Add: 1,000,000 bsbm-5m.nt.gz (Batch: 171,174 / Avg: 144,404)
>> INFO  Add: 1,500,000 bsbm-5m.nt.gz (Batch: 190,403 / Avg: 157,051)
>> INFO  Add: 2,000,000 bsbm-5m.nt.gz (Batch: 200,883 / Avg: 166,112)
>> INFO  Add: 2,500,000 bsbm-5m.nt.gz (Batch: 206,953 / Avg: 172,938)
>> INFO  Add: 3,000,000 bsbm-5m.nt.gz (Batch: 205,170 / Avg: 177,588)
>> INFO  Add: 3,500,000 bsbm-5m.nt.gz (Batch: 198,255 / Avg: 180,272)
>> INFO  Add: 4,000,000 bsbm-5m.nt.gz (Batch: 147,449 / Avg: 175,392)
>> INFO  Add: 4,500,000 bsbm-5m.nt.gz (Batch: 159,642 / Avg: 173,490)
>> INFO  Add: 5,000,000 bsbm-5m.nt.gz (Batch: 166,777 / Avg: 172,795)
>> INFO    Elapsed: 28.94 seconds [2019/11/28 10:17:55 GMT]
>> INFO  Finished: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz: 5,000,599 tuples
>> in 28.96s (Avg: 172,690)
>> INFO  Finish - index POS
>> INFO  Finish - index SPO
>> INFO  Finish - index OSP
>> INFO  Time = 39.180 seconds : Triples = 5,000,599 : Rate = 127,631 /s
>>
>> though the default may be faster on this small datset
>>
>> >
>> > On Thu, 14 Nov, 2019, 2:20 PM Andy Seaborne, <[email protected]> wrote:
>> >
>> >> Firstly - just to be clear - tdb.tdbloader2 is (confusingly) for TDB1.
>> >> Old name, before TDB2 came along so we're a bit stuck with it.
>> >>
>> >> tdbloader2 respects the $TMPDIR environment variable.
>> >>
>> >> Or set the SORT_ARGS environment variable with --temporary-directory=
>> >> (or -T). See tdbloader2 --help
>> >>
>> >>       Andy
>> >>
>> >> On 14/11/2019 02:54, Amandeep Srivastava wrote:
>> >>> I was trying to test the performance of tdb.tdbloader2 by creating a
>> TDB
>> >>> database. The loader failed at sort SPO step. The failure seems to
>> occur
>> >>> because of insufficient storage in the /tmp folder. Can we point tdb
>> to
>> >> use
>> >>> another folder as /tmp?
>> >>>
>> >>> Error log:
>> >>> sort: write failed: /tmp/sortxRql3B: No space left on device
>> >>>
>> >>> On Wed, 13 Nov, 2019, 5:37 PM Amandeep Srivastava, <
>> >>> [email protected]> wrote:
>> >>>
>> >>>> Thanks, Andy, for the detailed explanation :)
>> >>>>
>> >>>> On Wed, 13 Nov, 2019, 4:52 PM Andy Seaborne, <[email protected]>
>> wrote:
>> >>>>
>> >>>>>
>> >>>>>
>> >>>>> On 12/11/2019 15:53, Amandeep Srivastava wrote:
>> >>>>>> Thanks for the heads up, Dan. Will go and check the archives.
>> >>>>>>
>> >>>>>> I think I should get how to decide between tdb and TDB2 in the
>> >> archives
>> >>>>>> itself.
>> >>>>>
>> >>>>> For large bulk loaders, the TDB2 loader is faster, if you use
>> >>>>> --loader-parallel (NB it can take over your machine's I/O!)
>> >>>>>
>> >>>>> See tdb2.tdbloader --help for names of plans that are built-in.
>> >>>>>
>> >>>>> The only way to know which is best is to try but
>> >>>>>
>> >>>>>
>> >>>>> The order threading used is:
>> >>>>>
>> >>>>> sequential < light < phased < parallel
>> >>>>>
>> >>>>> (it does not always mean more threads is faster).
>> >>>>>
>> >>>>> sequential is roughly the same as the TDB1 bulk loader.
>> >>>>>
>> >>>>> parallel usualy wins as data gets larger (several 100m) if the
>> machine
>> >>>>> has the I/O to handle it.
>> >>>>>
>> >>>>>        Andy
>> >>>>>
>> >>>>>>
>> >>>>>> On Tue, 12 Nov, 2019, 8:59 PM Dan Pritts, <[email protected]> wrote:
>> >>>>>>
>> >>>>>>> Look through the list archives for posts from Andy describing the
>> >>>>>>> differences between tdb1 and tdb2. they have different
>> >> optimizations; I
>> >>>>>>> don't recall the differences.
>> >>>>>>>
>> >>>>>>> thanks
>> >>>>>>> danno
>> >>>>>>>
>> >>>>>>> Dan Pritts
>> >>>>>>> ICPSR Computing and Network Services
>> >>>>>>>
>> >>>>>>> On 12 Nov 2019, at 7:29, Amandeep Srivastava wrote:
>> >>>>>>>
>> >>>>>>>> Hi,
>> >>>>>>>>
>> >>>>>>>> I'm trying to create a TDB database from Wikidata's official RDF
>> >> dump
>> >>>>>>>> to
>> >>>>>>>> read the data using Fuseki service. I need to make a few queries
>> for
>> >>>>>>>> my
>> >>>>>>>> personal project, running which the online service times out.
>> >>>>>>>>
>> >>>>>>>> I have a 12 core machine with 36 GB memory.
>> >>>>>>>>
>> >>>>>>>> Can you please advise on the best way for creating the database?
>> >> Since
>> >>>>>>>> the
>> >>>>>>>> dump is huge, I cannot try all the approaches. Besides, I'm not
>> sure
>> >>>>>>>> if the
>> >>>>>>>> tdbloader function works in a similar way on data of different
>> >> sizes.
>> >>>>>>>>
>> >>>>>>>> Questions:
>> >>>>>>>>
>> >>>>>>>> 1. Which one would be better to use - tdb.tdbloader2 (TDB1) or
>> >>>>>>>> tdb2.tdbloader (TDB2) for creating the database and why? Any
>> >> specific
>> >>>>>>>> configurations that I should be aware of?
>> >>>>>>>>
>> >>>>>>>> 2. I'm running a job currently using tdb.tdbloader2 but it is
>> using
>> >>>>>>>> just a
>> >>>>>>>> single core. Also, it's loading speed is decreasing slowly. It
>> >> started
>> >>>>>>>> at
>> >>>>>>>> an avg of 120k tuples and is currently at 80k tuples. Can you
>> advise
>> >>>>>>>> how
>> >>>>>>>> can I utilize all the cores of my machine and maintain the
>> loading
>> >>>>>>>> speed at
>> >>>>>>>> the same time?
>> >>>>>>>>
>> >>>>>>>> Regards,
>> >>>>>>>> Aman
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> >
>>
>

Re: TDB optimization query

Reply via email to