Re: Is There Any Way to Shorten The Waiting Time After Upload Triples in Jena?

2017-12-24 Thread Laura Morales
>From what I can tell, and from my little experience, you should not see such 
>long waiting/idling times. But I've never used Windows (and I'm confident 
>you'll get a better environment if you just switched to gnu/linux).
Anyway, you could try to merge all your files into a single .nt (using RIOT) 
and load this file only.
 
 

Sent: Monday, December 25, 2017 at 5:51 AM
From: "Shengyu Li" 
To: users@jena.apache.org
Subject: Is There Any Way to Shorten The Waiting Time After Upload Triples in 
Jena?

Hello,
 
I am uploading my .ttl data to my database, there are totally about 10,000 
files and each file is about 4M. My new data is totally about 40GB. My 
origional db is also about 40GB. The server is in my local computer.
 
I use tdbloader.bat --loc to upload data. After the Finish quads load, it will 
pause at this status for a long time (about half an hr for one file (4M), but 
if for 200 files one time(200*4M), the pause time will be 2 hrs). After the 
pause, the work will go back to the cmd.
 
I guess the pause means the db is doing the organization about the data I 
uploaded just now, so won't return for a long time, am I right? Is there any 
way to shorten the waiting time?
 
Thank you very much! Jena is really a useful thing! 
 
Best,
Shengyu


Is There Any Way to Shorten The Waiting Time After Upload Triples in Jena?

2017-12-24 Thread Shengyu Li
Hello,

I am uploading my .ttl data to my database, there are totally about 10,000
files and each file is about 4M. My new data is totally about 40GB. My
origional db is also about 40GB. The server is in my local computer.

I use tdbloader.bat --loc to upload data. After the Finish quads load, it
will pause at this status for a long time (about half an hr for one file
(4M), but if for 200 files one time(200*4M), the pause time will be 2 hrs).
After the pause, the work will go back to the cmd.
[image: Inline image 1]

I guess the pause means the db is doing the organization about the data I
uploaded just now, so won't return for a long time, am I right? Is there
any way to shorten the waiting time?

Thank you very much! Jena is really a useful thing!

Best,
Shengyu


Re: Python bindings?

2017-12-24 Thread dandh988
We use Python against Jena/Fuseki/CustomHTTP and find direct SPARQL against the 
endpoint to be "fast". The Python Devs dropped using the RDFLib.
We also have a Thirft connection in development which is proving useful for low 
level Jena API access.

Dick
 Original message From: Stefano Cossu  Date: 
24/12/2017  22:10  (GMT+00:00) To: users@jena.apache.org Subject: Python 
bindings? 
Hello,
I am writing a LDP server using Python's RDFlib and Fuseki/TDB as a back 
end store.

Right now my application is very slow, I suspect due to the HTTP 
overhead: profiling shows a large chunk of time waiting for sockets.

Is there a reliable way to write Python code against the Fuseki Java 
API? I understand that Fuseki is written in Java and there are no native 
Python bindings. I have looked at options such as Jython, Jpype and 
PyJnius but I am wondering how reliable these options are. Any suggestions?

Thanks,
Stefano


Python bindings?

2017-12-24 Thread Stefano Cossu

Hello,
I am writing a LDP server using Python's RDFlib and Fuseki/TDB as a back 
end store.


Right now my application is very slow, I suspect due to the HTTP 
overhead: profiling shows a large chunk of time waiting for sockets.


Is there a reliable way to write Python code against the Fuseki Java 
API? I understand that Fuseki is written in Java and there are no native 
Python bindings. I have looked at options such as Jython, Jpype and 
PyJnius but I am wondering how reliable these options are. Any suggestions?


Thanks,
Stefano


Re: Deleting triples from default graph

2017-12-24 Thread Stefano Cossu
Thanks Adam. I was actually wondering why rdflib would behave in a 
different way using an in-memory store (it actually allows you to delete 
triples from any graph by not specifying one). I am still testing this 
under both scenarios but what you say makes sense.


Best,
Stefano

On 12/20/2017 08:21 AM, ajs6f wrote:

Just to be clear, Stefano, the default graph is a read-only view specifically 
_because_ you set `tdb:unionDefaultGraph` to true.

Suppose three named graphs all contain the same triple and therefore that 
triple appears in a union default graph. If you try to delete the triple from 
the union, it's not clear from which of the named graphs it should be deleted. 
If you don't set the default graph to be a union, it behaves much like any 
named graph (wrt mutation).

ajs6f


On Dec 19, 2017, at 4:01 PM, Stefano Cossu  wrote:

Thanks for the clarification Andy. I was not aware of the fact that the default 
graph is just a read-only view.

Stefano

On 12/19/2017 02:56 PM, Andy Seaborne wrote:

Stefano,

Is there any way I can specify a union graph IRI for update?

The triples really are in the named graph - the default graph for query is a 
view of all the named graphs.  To delete, delete from the named graph and they 
will disappear from the default graph (if there are not in another graph as 
well).
DELETE DATA {
GRAPH  {
 
}
}
In an update the default graph changes go to the real (storage) default graph.
 Andy
On 19/12/17 17:53, Stefano Cossu wrote:

Hello,

I have inserted this data set on a TDB-backed database, with the 
`tdb:unionDefaultGraph` option set to true:

INSERT DATA {
   GRAPH  {
.
.
   }
}


If I query the default graph I can see the triples:

SELECT * {
   ?s ?p ?o .
}

However, if I try to delete triples from the default graph without naming the 
named graph, the triples won't go away:

DELETE {
  .
} WHERE {
  .
}

If I specify the named graph, or use a variable for it, the triple is deleted. 
However, I am using Python's RDFLib to manage the interaction with Fuseki and I 
don't have an easy way to perform an update query on a graph indicated by a 
variable.

The default Jena graph name `` won't work either.

My configuration is pretty standard:

@prefix :   .

@prefix tdb:    .
@prefix rdf:    .
@prefix ja: .
@prefix rdfs:   .
@prefix fuseki:  .


:service_tdb_all  a   fuseki:Service ;
 rdfs:label"TDB dev" ;
 fuseki:dataset:tdb_dataset_readwrite ;
 fuseki:name   "dev" ;
 fuseki:serviceQuery   "query" , "sparql" ;
 fuseki:serviceReadGraphStore  "get" ;
 fuseki:serviceReadWriteGraphStore
 "data" ;
 fuseki:serviceUpdate  "update" ;
 fuseki:serviceUpload  "upload" .


:tdb_dataset_readwrite
 a tdb:DatasetTDB ;
 tdb:unionDefaultGraph true ;
 tdb:location  "/opt/fuseki/current/run/databases/dev" .


Is there any way I can specify a union graph IRI for update?

Thanks,
Stefano





--
Stefano Cossu
Director of Application Services, Collections

The Art Institute of Chicago
116 S. Michigan Ave.
Chicago, IL 60603
312-499-4026



Re: performance measures

2017-12-24 Thread Andrew U. Frank

Thank you for the good advice!

The argument is to show that triple store are fast enough for linguist 
application. five years ago a comparison was published, where a 
proprietary data structure excelled. i would like to show that 
triple-stores are fast enough today. I can perhaps get the same dataset 
and the same queries (at the application level), but i have no idea how 
cache data was accounted; it seems that results differed between runs.


i guess I could use some warmup queries, sort of similar to the 
application queries for the test and then run the test queries and 
compare with the previously produced response times. If the response 
time is of the same order of magnitude than before, it would be shown 
that triple-store is fast enough.


Does this sound "good enough"?


On 12/24/2017 01:24 PM, ajs6f wrote:

Any measurements would be unreliable at best and probably worthless.
1/ Different data gives different answers to queries.
2/ Caching matters a lot for databases and a different setup will cache 
differently.

This is so true, and it's not even a complete list. It might be better to 
approach the problem from the application layer. Are you able to put together a 
good suite of test data, queries, and updates, accompanied by a good 
understanding of the kinds of load the triplestore will experience in 
production?

Adam Soroka


On Dec 24, 2017, at 1:21 PM, Andy Seaborne  wrote:

On 24/12/17 14:11, Andrew U. Frank wrote:

thank you for the information; i take that using teh indexes  a one-variable 
query would be (close to) linear in the amount of triples found. i saw that TBD 
does build indexes and assumed they use hashes.
i have still the following questions:
1. is performance different for a named or the default graph?

Query performance is approximately the same for GRAPH.
Update is slower.


2. can i simplify measurements with putting pieces of the dataset in different 
graphs and then add more or less of these graphs to take a measure? say i have 
5 named graphs, each with 10 million triples, do queries over 2, 3, 4 and 5 
graphs give the same (or very similar) results than when i would load 20, 30, 
40 and 50 million triples in a single named graph?

Any measurements would be unreliable at best and probably worthless.

1/ Different data gives different answers to queries.

2/ Caching matters a lot for databases and a different setup will cache 
differently.

Andy


thank you for help!
andrew
On 12/23/2017 06:20 AM, ajs6f wrote:

For example, the TIM in-memory dataset impl uses 3 indexes on triples and 6 on quads to ensure that all one-variable queries (i.e. 
for triples ?s  ,  ?p ,   ?o) will be as direct as possible. The indexes are 
hashmaps (e.g. Map>>) and don't use the kind of node directory that TDB does.

There are lots of other ways to play that out, according to the balance of 
times costs and storage costs desired and the expected types of queries.

Adam


On Dec 23, 2017, at 2:56 AM, Lorenz Buehmann 
 wrote:


On 23.12.2017 00:47, Andrew U. Frank wrote:

are there some rules which queries are linear in the amount of data in
the graph? is it correct to assume that searching for a triples based
on a single condition (?p a X) is logarithmic in the size of the data
collection?

Why should it be logarithmic? The complexity of matching a single BGP
depends on the implementation. I could search for matches by doing a
scan on the whole dataset - that would for sure be not logarithmic but
linear. Usually, if exists, a triple store would use the POS index in
order to find bindings for variable ?p.

Cheers,
Lorenz


--
em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
 +43 1 58801 12710 direct
Geoinformation, TU Wien  +43 1 58801 12700 office
Gusshausstr. 27-29   +43 1 55801 12799 fax
1040 Wien Austria+43 676 419 25 72 mobil



Re: performance measures

2017-12-24 Thread ajs6f
> Any measurements would be unreliable at best and probably worthless.
> 1/ Different data gives different answers to queries.
> 2/ Caching matters a lot for databases and a different setup will cache 
> differently.

This is so true, and it's not even a complete list. It might be better to 
approach the problem from the application layer. Are you able to put together a 
good suite of test data, queries, and updates, accompanied by a good 
understanding of the kinds of load the triplestore will experience in 
production?

Adam Soroka

> On Dec 24, 2017, at 1:21 PM, Andy Seaborne  wrote:
> 
> On 24/12/17 14:11, Andrew U. Frank wrote:
>> thank you for the information; i take that using teh indexes  a one-variable 
>> query would be (close to) linear in the amount of triples found. i saw that 
>> TBD does build indexes and assumed they use hashes.
>> i have still the following questions:
>> 1. is performance different for a named or the default graph?
> 
> Query performance is approximately the same for GRAPH.
> Update is slower.
> 
>> 2. can i simplify measurements with putting pieces of the dataset in 
>> different graphs and then add more or less of these graphs to take a 
>> measure? say i have 5 named graphs, each with 10 million triples, do queries 
>> over 2, 3, 4 and 5 graphs give the same (or very similar) results than when 
>> i would load 20, 30, 40 and 50 million triples in a single named graph?
> 
> Any measurements would be unreliable at best and probably worthless.
> 
> 1/ Different data gives different answers to queries.
> 
> 2/ Caching matters a lot for databases and a different setup will cache 
> differently.
> 
>Andy
> 
>> thank you for help!
>> andrew
>> On 12/23/2017 06:20 AM, ajs6f wrote:
>>> For example, the TIM in-memory dataset impl uses 3 indexes on triples and 6 
>>> on quads to ensure that all one-variable queries (i.e. for triples ?s  
>>> ,  ?p ,   ?o) will be as direct as possible. The indexes are 
>>> hashmaps (e.g. Map>>) and don't use the kind of 
>>> node directory that TDB does.
>>> 
>>> There are lots of other ways to play that out, according to the balance of 
>>> times costs and storage costs desired and the expected types of queries.
>>> 
>>> Adam
>>> 
 On Dec 23, 2017, at 2:56 AM, Lorenz Buehmann 
  wrote:
 
 
 On 23.12.2017 00:47, Andrew U. Frank wrote:
> are there some rules which queries are linear in the amount of data in
> the graph? is it correct to assume that searching for a triples based
> on a single condition (?p a X) is logarithmic in the size of the data
> collection?
 Why should it be logarithmic? The complexity of matching a single BGP
 depends on the implementation. I could search for matches by doing a
 scan on the whole dataset - that would for sure be not logarithmic but
 linear. Usually, if exists, a triple store would use the POS index in
 order to find bindings for variable ?p.
 
 Cheers,
 Lorenz



Re: performance measures

2017-12-24 Thread Andy Seaborne

On 24/12/17 14:11, Andrew U. Frank wrote:
thank you for the information; i take that using teh indexes  a 
one-variable query would be (close to) linear in the amount of triples 
found. i saw that TBD does build indexes and assumed they use hashes.


i have still the following questions:

1. is performance different for a named or the default graph?


Query performance is approximately the same for GRAPH.
Update is slower.



2. can i simplify measurements with putting pieces of the dataset in 
different graphs and then add more or less of these graphs to take a 
measure? say i have 5 named graphs, each with 10 million triples, do 
queries over 2, 3, 4 and 5 graphs give the same (or very similar) 
results than when i would load 20, 30, 40 and 50 million triples in a 
single named graph?


Any measurements would be unreliable at best and probably worthless.

1/ Different data gives different answers to queries.

2/ Caching matters a lot for databases and a different setup will cache 
differently.


Andy



thank you for help!

andrew


On 12/23/2017 06:20 AM, ajs6f wrote:
For example, the TIM in-memory dataset impl uses 3 indexes on triples 
and 6 on quads to ensure that all one-variable queries (i.e. for 
triples ?s  ,  ?p ,   ?o) will be as direct as 
possible. The indexes are hashmaps (e.g. MapSet>>) and don't use the kind of node directory that TDB does.


There are lots of other ways to play that out, according to the 
balance of times costs and storage costs desired and the expected 
types of queries.


Adam

On Dec 23, 2017, at 2:56 AM, Lorenz Buehmann 
 wrote:



On 23.12.2017 00:47, Andrew U. Frank wrote:

are there some rules which queries are linear in the amount of data in
the graph? is it correct to assume that searching for a triples based
on a single condition (?p a X) is logarithmic in the size of the data
collection?

Why should it be logarithmic? The complexity of matching a single BGP
depends on the implementation. I could search for matches by doing a
scan on the whole dataset - that would for sure be not logarithmic but
linear. Usually, if exists, a triple store would use the POS index in
order to find bindings for variable ?p.

Cheers,
Lorenz




Re: performance measures

2017-12-24 Thread Andrew U. Frank
thank you for the information; i take that using teh indexes  a 
one-variable query would be (close to) linear in the amount of triples 
found. i saw that TBD does build indexes and assumed they use hashes.


i have still the following questions:

1. is performance different for a named or the default graph?

2. can i simplify measurements with putting pieces of the dataset in 
different graphs and then add more or less of these graphs to take a 
measure? say i have 5 named graphs, each with 10 million triples, do 
queries over 2, 3, 4 and 5 graphs give the same (or very similar) 
results than when i would load 20, 30, 40 and 50 million triples in a 
single named graph?


thank you for help!

andrew


On 12/23/2017 06:20 AM, ajs6f wrote:

For example, the TIM in-memory dataset impl uses 3 indexes on triples and 6 on quads to ensure that all one-variable queries (i.e. 
for triples ?s  ,  ?p ,   ?o) will be as direct as possible. The indexes are 
hashmaps (e.g. Map>>) and don't use the kind of node directory that TDB does.

There are lots of other ways to play that out, according to the balance of 
times costs and storage costs desired and the expected types of queries.

Adam


On Dec 23, 2017, at 2:56 AM, Lorenz Buehmann 
 wrote:


On 23.12.2017 00:47, Andrew U. Frank wrote:

are there some rules which queries are linear in the amount of data in
the graph? is it correct to assume that searching for a triples based
on a single condition (?p a X) is logarithmic in the size of the data
collection?

Why should it be logarithmic? The complexity of matching a single BGP
depends on the implementation. I could search for matches by doing a
scan on the whole dataset - that would for sure be not logarithmic but
linear. Usually, if exists, a triple store would use the POS index in
order to find bindings for variable ?p.

Cheers,
Lorenz


--
em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
 +43 1 58801 12710 direct
Geoinformation, TU Wien  +43 1 58801 12700 office
Gusshausstr. 27-29   +43 1 55801 12799 fax
1040 Wien Austria+43 676 419 25 72 mobil