Re: Apache Jena Fuseki - Is there any other way to backup seperately when my data is large?

2017-12-14 Thread Conal Tuohy
Try using the "split" command line tool (or similar) to split the nq file
into smaller chunks, then you can restore each chunk individually. Note
that the N-Quads format is a line-based format designed to be able to be
split at line endings.



On 13 December 2017 at 11:43, Shengyu Li 
wrote:

> Hi,
>
> My question is about Apache Jena Fuseki.
>
> Here is my situation.
>
> I backed up about 50G (Fuseki\run\databases\MYDATANAME) data using the
> backup button, and I got the backup from 'Fuseki\run\backups', the file is
> MYDATANAME.nq.gz. It is about 2G.
> [image: Inline image 1]
> When I was trying to upload the backup into a new server. Upload will be
> interrupted and Fuseki stopped with the error 'java heap space'. So I
> enlarged the space in file Fuseki\fuseki-server.bat. 50G failed with the
> same error. I kept meeting 'java heap space' until I changed it into 60G
> (My RAM is 64G). Then it uploaded successfully.
> [image: Inline image 2]
> My database will keep on increasing. I was worried in the future when I
> backup my data with larger size, I will be unable to upload them with the
> limited RAM. *When making the backup, is there any way to make the backup
> into several small pieces, or anyway to backup with specific limitation* (For
> example: The whole data contains everyone's data int a country, the backup
> of everyone's data in the country is very large and there may don't have
> enough RAM for it to upload. So when making the backup, I can use some
> search sentences to restrict it into one province's people data, after
> backup each province, I will got the whole data of my old server.), or i
> can use command line to restrict my backup? I didn't find information about
> it from the documentation of Fuseki.
>
> Thank you very much!
>
> Sincerely,
> Sherry
>



-- 
Conal Tuohy
http://conaltuohy.com/
@conal_tuohy
+61-466-324297


Very very slow query when using a high OFFSET

2017-12-14 Thread Laura Morales
During one of my countless tests... I've setup Fuseki with a HDT store. In 
particular, the store is "wikidata.hdt".

Then I've ran this query from Fuseki web UI:

SELECT ?s
WHERE { ?s a  }
LIMIT 10
OFFSET 2000

this query takes forever... so much forever in fact, that I killed it after 15 
minutes with no results. CPU 100% on *all* threads, Java VM using all the 
allocated RAM (6G), no swap nor disk activity.
I don't know where the problem is, especially because I don't know the dynamics 
among Fuseki/Jena and the HDT binding (hdt-java).

However:

- hdt-cpp has a small CLI tool that allows to match simple patterns like "? ? 
?" or "? a ?", so I searched for "? a " 
and grepped all the output filtering out the first 20M triples. From when I 
issued the command, to when I started to see the first results, it elapsed 
about 1 minute (and a few seconds). *NOTE:* this time also takes into account 
the time needed to setup the Java VM, map the HDT file into memory, and load 
some HDT indices (in memory)

- hdt-java (the Fuseki binding) instead, has a CLI tool called "hdtsparql" that 
allows to run sparql queries directly against a HDT file, and AFAICT it uses 
Jena ARQ. This tool also has some initialization time linked to the loading of 
the HDT, anyway the query (above) was answered in 35 seconds (load time + query)

So as I said, I don't know what's going on between Fuseki and HDT-Java, but 
this looks like a problem with Fuseki. Can somebody else confirm? Any idea? 
Hint?


Re: Report on loading wikidata

2017-12-14 Thread Laura Morales
> The loaders work on empty databases.

Yes my test is on a new empty dataset. The command that I use is `tdbloader2 
--loc wikidata wikidata.ttl`

> If you are splitting files, and doing partial loads, things are rather 
> different.

No I'm using the whole file. I'd only consider splitting it if there were a way 
to use "FROM " as an alias for "FROM  FROM  
FROM  ..."

> Maybe swappiness is set to keep a %-age of RAM free.

My swappiness is set to 10.
Disk read speed: 2-3MB/s | Disk write speed: 40-50MB/s  (slowing down over 
time). I think what Dick said is correct; that is, as the index and stored data 
grows, the disk can't keep up. I think a single HDD just doesn't cut it. 
Perhaps a SSD can do it, I don't know because I don't have one. Maybe I should 
try with many hard disks... one to host the 200GB source, one to handle 
data-triples.tmp, one for node2id.net, one for nodes.dat, and so forth...


Re: Report on loading wikidata (errata)

2017-12-14 Thread Andy Seaborne

>> (processing batches of 25K)

The loaders work on empty databases.

tdbloader will load into a existing none but it does not do anything 
special and you'll get RAM contention.


If you are splitting files, and doing partial loads, things are rather 
different.


>> Right now resident memory is ~3.5GB and virtual ~5.5GB

Maybe swappiness is set to keep a %-age of RAM free.

Andy


On 14/12/17 20:38, dandh988 wrote:

Your IO doesn't know whether it's coming or going!
You're reading from a 250GB file whilst writing to two .tmp files and the id to 
node files. Then you are reading the data-triple.tmp to sort it which will be 
writing to tmp whilst chewing RAM because it's too big to sort in memory and 
writing the sorted file to then read it whilst writing the index files. Repeat 
three times.
Your HDD heads can only be in one place at a time and I suspect you've only got 
maximum 128MB cache on the drive. The queues on the drive will go through the 
roof and if the OS decides to page it'll be properly screwed!
SSD can service deep queues because it can be in more than one place at a time, 
as an analogy.
Stick the 250GB file on a USB drive to get that read load off the internal IO 
as a start.
The loader works on HDD's you just need to be a little smart in understanding 
the limits of the hardware you're using and laptops are not known for IO 
chipsets. Even my Dell M3800 which is supposed to be a workstation grade laptop 
has one drive and an external SATA connection to help out.


Dick
 Original message From: Laura Morales  Date: 
14/12/2017  20:09  (GMT+00:00) To: jena-users-ml  Subject: Re: 
Report on loading wikidata (errata)
ERRATA:


I don't know why then. Maybe SSD is making all the difference. Try to load it (or 
"latest-all") on a comparable machine using a single SATA disk instead of SSD.


s/SATA/HDD






I loaded 2.2B on a 16G machine which wasn't even server class (i.e. it's
I/O path to SSD isn't very quick).


I don't know why then. Maybe SSD is making all the difference. Try to load it (or 
"latest-all") on a comparable machine using a single SATA disk instead of SSD. 
Around 100-150M my computer slows dows significantly, and then always down from here. All 
I know is that it's either because of too little RAM, or because the disk can't keep up.


If RAM really is at 1G , even on your small 8G server, suggests your
setup is configured in the OS to restrict the RAM for mapping. RAM per
process should be > real RAM (remember memory mapped files are used) or
the VM is setup in some odd way. Or 32bit java.


Yeah sorry I was looking at shared memory. Right now resident memory is ~3.5GB 
and virtual ~5.5GB. Process started with 150K triples per second, now after 
250M triples processed is at 50K triples/second and slowing down (processing 
batches of 25K). I don't know what to say, I think the conclusion is simply 
that tdbloader (any version) just doesn't work with large graphs on HDDs. So 
the only solution has to be to use an SSD, or find a way to split the graph 
into smaller stores, or simply give up.

$ java -version
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-1~deb9u1-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)

$ ulimit -a
-t: cpu time (seconds) unlimited
-f: file size (blocks) unlimited
-d: data seg size (kbytes) unlimited
-s: stack size (kbytes) 8192
-c: core file size (blocks) 0
-m: resident set size (kbytes) unlimited
-u: processes 31370
-n: file descriptors 1024
-l: locked-in-memory size (kbytes) unlimited
-v: address space (kbytes) unlimited
-x: file locks unlimited
-i: pending signals 31370
-q: bytes in POSIX msg queues 819200
-e: max nice 0
-r: max rt priority 95
-N 15: unlimited



Re: Report on loading wikidata (errata)

2017-12-14 Thread dandh988
Your IO doesn't know whether it's coming or going! 
You're reading from a 250GB file whilst writing to two .tmp files and the id to 
node files. Then you are reading the data-triple.tmp to sort it which will be 
writing to tmp whilst chewing RAM because it's too big to sort in memory and 
writing the sorted file to then read it whilst writing the index files. Repeat 
three times.
Your HDD heads can only be in one place at a time and I suspect you've only got 
maximum 128MB cache on the drive. The queues on the drive will go through the 
roof and if the OS decides to page it'll be properly screwed!
SSD can service deep queues because it can be in more than one place at a time, 
as an analogy. 
Stick the 250GB file on a USB drive to get that read load off the internal IO 
as a start.
The loader works on HDD's you just need to be a little smart in understanding 
the limits of the hardware you're using and laptops are not known for IO 
chipsets. Even my Dell M3800 which is supposed to be a workstation grade laptop 
has one drive and an external SATA connection to help out.


Dick
 Original message From: Laura Morales  Date: 
14/12/2017  20:09  (GMT+00:00) To: jena-users-ml  
Subject: Re: Report on loading wikidata (errata) 
ERRATA:

> I don't know why then. Maybe SSD is making all the difference. Try to load it 
> (or "latest-all") on a comparable machine using a single SATA disk instead of 
> SSD.

s/SATA/HDD





> I loaded 2.2B on a 16G machine which wasn't even server class (i.e. it's
> I/O path to SSD isn't very quick).

I don't know why then. Maybe SSD is making all the difference. Try to load it 
(or "latest-all") on a comparable machine using a single SATA disk instead of 
SSD. Around 100-150M my computer slows dows significantly, and then always down 
from here. All I know is that it's either because of too little RAM, or because 
the disk can't keep up.

> If RAM really is at 1G , even on your small 8G server, suggests your
> setup is configured in the OS to restrict the RAM for mapping. RAM per
> process should be > real RAM (remember memory mapped files are used) or
> the VM is setup in some odd way. Or 32bit java.

Yeah sorry I was looking at shared memory. Right now resident memory is ~3.5GB 
and virtual ~5.5GB. Process started with 150K triples per second, now after 
250M triples processed is at 50K triples/second and slowing down (processing 
batches of 25K). I don't know what to say, I think the conclusion is simply 
that tdbloader (any version) just doesn't work with large graphs on HDDs. So 
the only solution has to be to use an SSD, or find a way to split the graph 
into smaller stores, or simply give up.

$ java -version
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-1~deb9u1-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)

$ ulimit -a
-t: cpu time (seconds) unlimited
-f: file size (blocks) unlimited
-d: data seg size (kbytes) unlimited
-s: stack size (kbytes) 8192
-c: core file size (blocks) 0
-m: resident set size (kbytes) unlimited
-u: processes 31370
-n: file descriptors 1024
-l: locked-in-memory size (kbytes) unlimited
-v: address space (kbytes) unlimited
-x: file locks unlimited
-i: pending signals 31370
-q: bytes in POSIX msg queues 819200
-e: max nice 0
-r: max rt priority 95
-N 15: unlimited


Re: Report on loading wikidata (errata)

2017-12-14 Thread Laura Morales
ERRATA:

> I don't know why then. Maybe SSD is making all the difference. Try to load it 
> (or "latest-all") on a comparable machine using a single SATA disk instead of 
> SSD.

s/SATA/HDD





> I loaded 2.2B on a 16G machine which wasn't even server class (i.e. it's
> I/O path to SSD isn't very quick).

I don't know why then. Maybe SSD is making all the difference. Try to load it 
(or "latest-all") on a comparable machine using a single SATA disk instead of 
SSD. Around 100-150M my computer slows dows significantly, and then always down 
from here. All I know is that it's either because of too little RAM, or because 
the disk can't keep up.

> If RAM really is at 1G , even on your small 8G server, suggests your
> setup is configured in the OS to restrict the RAM for mapping. RAM per
> process should be > real RAM (remember memory mapped files are used) or
> the VM is setup in some odd way. Or 32bit java.

Yeah sorry I was looking at shared memory. Right now resident memory is ~3.5GB 
and virtual ~5.5GB. Process started with 150K triples per second, now after 
250M triples processed is at 50K triples/second and slowing down (processing 
batches of 25K). I don't know what to say, I think the conclusion is simply 
that tdbloader (any version) just doesn't work with large graphs on HDDs. So 
the only solution has to be to use an SSD, or find a way to split the graph 
into smaller stores, or simply give up.

$ java -version
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-1~deb9u1-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)

$ ulimit -a
-t: cpu time (seconds) unlimited
-f: file size (blocks) unlimited
-d: data seg size (kbytes) unlimited
-s: stack size (kbytes) 8192
-c: core file size (blocks) 0
-m: resident set size (kbytes) unlimited
-u: processes 31370
-n: file descriptors 1024
-l: locked-in-memory size (kbytes) unlimited
-v: address space (kbytes) unlimited
-x: file locks unlimited
-i: pending signals 31370
-q: bytes in POSIX msg queues 819200
-e: max nice 0
-r: max rt priority 95
-N 15: unlimited


Re: TDB multiple locations

2017-12-14 Thread Laura Morales
> Have you seen the dynamic dataset feature?

Yes however how this works is with multiple FROM clauses, for example

SELECT ...
FROM 
FROM 
FROM 
FROM 
...

which is very inconvenient if I've split my graph in multiple stores. For 
instance, wikidata would become

SELECT ...
FROM 
FROM 
FROM 
FROM 
...

what I was saying instead, is to define some kind of alias such that I can use

SELECT ... FROM 

and internally Fuseki knows that I'm refering to  + 
 +  +  + ...

this would also be useful when using

SELECT ...
WHERE {
GRAPH  {
}
}

Either this, or a way to declare a graph with multiple locations.


Re: TDB multiple locations

2017-12-14 Thread Andy Seaborne



On 14/12/17 06:19, Laura Morales wrote:

Hi all,

reading the TDB assembler doc [1], I was wondering how easy/difficult it would be to support multiple ja:graph for a ja:namedGraph, 


That's not what ja:graph is for - it is the graph, and ja:namedGraph 
adds the name.


How the graph is formed is down to the part of the assembler that 
ja:graph points to.  I don't know of anything for partial graphs - there 
are some hard parts to that.


It you split a dataset across different storage units, query will slower 
because the direct execution of TDB does not apply.



so that a large graph such as Wikidata could potentially be split into smaller 
graphs.
Alternatively, since SPARQL supports multiple FROM instructions, there could be an option to configure "aliases" so that if 
the user writes "FROM " this will be translated to "FROM  
 ..." where each  is its own ja:namedGraph.

I think this could be equivalent to tdb:unionDefaultGraph, but instead of being 
a union among *all* graphs, only a union among a limited selection of graphs. 
It kinda sounds easy in theory, since it seems like a particular case of 
tdb:unionDefaultGraph, but... I don't know in practice if this would be 
possible?


tdb:unionDefaultGraph is quite specific for union and quite efficient.

Have you seen the dynamic dataset feature?

Arbitrary choices of graph mean tdb:unionDefaultGraph does not apply and 
a more general, and slower, execution happens.


Andy



[1] https://jena.apache.org/documentation/tdb/assembler.html



Re: TDB multiple locations

2017-12-14 Thread Andy Seaborne



On 14/12/17 15:32, ajs6f wrote:

Please start a new thread for a new topic.


Yes - please, a new topic or a JIRA ticket.

Something is wrong (I have checked and it as been for several versions.)

Andy



Did you actually try an update? Did it work?

ajs6f


On Dec 14, 2017, at 10:21 AM, Robert Nielsen  wrote:


Is it possible to start Fuseki with initial contents from a file (an OWL
ontology, not an existing TDB), and then allow updates?

When I start Fuseki with the following parameters:

./fuseki-server --file /Ontologies/MyOntology.owl --update /MyFuseki

I get a message that the the resource /MyFuseki is running in read-only
mode.   It appears the --update parameter does nothing.   But there is no
message that says the parameter is ignored (and why).   I can, of course,
start the Fuseki server and then load the ontology file ... but it seems
like I should be able to do it in one step.   What am I missing?

Running Apache Jena Fuseki 3.5.0 with Java 1.8.0_144 on Mac OS X 10.12.6
x86_64.

Robert Nielsen




Re: TDB multiple locations

2017-12-14 Thread Laura Morales
> Dick's Mosiac technology might be a better approach or not, depending on your 
> use cases and comfort with the cutting edge.

Mosaic is not merged yet, is it?
SERVICE could work, but it seems like an inelegant hack since I'll have to 
define a new service for every graph that I wish to partition. It still seems 
to me that using a "partial union" would make more sense...


Re: TDB multiple locations

2017-12-14 Thread ajs6f
You can use the SPARQL federate query mechanism for something like this:

https://www.w3.org/TR/sparql11-federated-query/

You would federate over a group of Fuseki instances or a single Fuseki instance 
with multiple TDB datasets loaded. Keep in mind that you will have to pay 
attention to how to shard triples into different stores to have performant 
query, but you would have to do that for any way of partitioning the data.

Dick's Mosiac technology might be a better approach or not, depending on your 
use cases and comfort with the cutting edge.

ajs6f

> On Dec 14, 2017, at 1:19 AM, Laura Morales  wrote:
> 
> Hi all,
> 
> reading the TDB assembler doc [1], I was wondering how easy/difficult it 
> would be to support multiple ja:graph for a ja:namedGraph, so that a large 
> graph such as Wikidata could potentially be split into smaller graphs.
> Alternatively, since SPARQL supports multiple FROM instructions, there could 
> be an option to configure "aliases" so that if the user writes "FROM 
> " this will be translated to "FROM  
>  ..." where each  is its own ja:namedGraph.
> 
> I think this could be equivalent to tdb:unionDefaultGraph, but instead of 
> being a union among *all* graphs, only a union among a limited selection of 
> graphs. It kinda sounds easy in theory, since it seems like a particular case 
> of tdb:unionDefaultGraph, but... I don't know in practice if this would be 
> possible?
> 
> [1] https://jena.apache.org/documentation/tdb/assembler.html



Re: TDB multiple locations

2017-12-14 Thread Andrew U. Frank
you can achieve this with a file in your fuseki source directory 
../run/configuration. i have found on the web an example which i used


from 
https://github.com/jfmunozf/Jena-Fuseki-Reasoner-Inference/wiki/Configuring-Apache-Jena-Fuseki-2.4.1-inference-and-reasoning-support-using-SPARQL-1.1:-Jena-inference-rules,-RDFS-Entailment-Regimes-and-OWL-reasoning


adapt the filenames to your names!

@prefix :   .
@prefix tdb:    .
@prefix rdf:    .
@prefix ja:     .
@prefix rdfs:   .
@prefix fuseki:  .

#This line is a comment

:service1    a    fuseki:Service ;
fuseki:dataset    :dataset ;
fuseki:name   "ElQuijote" ;
fuseki:serviceQuery   "query" , "sparql" ;
fuseki:serviceReadGraphStore  "get" ;
fuseki:serviceReadWriteGraphStore "data" ;
fuseki:serviceUpdate  "update" ;
fuseki:serviceUpload  "upload" .

:dataset rdf:type ja:RDFDataset ;
rdfs:label "ElQuijote" ;
ja:defaultGraph
[ rdfs:label "ElQuijote" ;
a ja:InfModel ;

    #Reference to model.ttl file
    ja:content [ja:externalContent <.../rdfsOntologyExample/model.ttl>  ] ;

    #Reference to data.ttl file
    ja:content [ja:externalContent <.../rdfsOntologyExample/data.ttl>  ] ;

    #Disable OWL-based reasoner
    ja:reasoner [ja:reasonerURL 
 ] ;


    #Disable RDFS-based reasoner
#    ja:reasoner [ja:reasonerURL 
] ;


    #Enable Jena Rules-based reasoner and we point the location of 
myrules.rules file

#    ja:reasoner [
#    ja:reasonerURL  ;
#    ja:rulesFrom 
 ;

#    ] ;
  ] ;
 .



On 12/14/2017 10:21 AM, Robert Nielsen wrote:

Is it possible to start Fuseki with initial contents from a file (an OWL
ontology, not an existing TDB), and then allow updates?

When I start Fuseki with the following parameters:

./fuseki-server --file /Ontologies/MyOntology.owl --update /MyFuseki

I get a message that the the resource /MyFuseki is running in read-only
mode.   It appears the --update parameter does nothing.   But there is no
message that says the parameter is ignored (and why).   I can, of course,
start the Fuseki server and then load the ontology file ... but it seems
like I should be able to do it in one step.   What am I missing?

Running Apache Jena Fuseki 3.5.0 with Java 1.8.0_144 on Mac OS X 10.12.6
x86_64.

Robert Nielsen



--
em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
 +43 1 58801 12710 direct
Geoinformation, TU Wien  +43 1 58801 12700 office
Gusshausstr. 27-29   +43 1 55801 12799 fax
1040 Wien Austria+43 676 419 25 72 mobil



Re: TDB multiple locations

2017-12-14 Thread ajs6f
Please start a new thread for a new topic.

Did you actually try an update? Did it work?

ajs6f

> On Dec 14, 2017, at 10:21 AM, Robert Nielsen  wrote:
> 
> 
> Is it possible to start Fuseki with initial contents from a file (an OWL
> ontology, not an existing TDB), and then allow updates?
> 
> When I start Fuseki with the following parameters:
> 
>   ./fuseki-server --file /Ontologies/MyOntology.owl --update /MyFuseki
> 
> I get a message that the the resource /MyFuseki is running in read-only
> mode.   It appears the --update parameter does nothing.   But there is no
> message that says the parameter is ignored (and why).   I can, of course,
> start the Fuseki server and then load the ontology file ... but it seems
> like I should be able to do it in one step.   What am I missing?
> 
> Running Apache Jena Fuseki 3.5.0 with Java 1.8.0_144 on Mac OS X 10.12.6
> x86_64.
> 
> Robert Nielsen



Re: TDB multiple locations

2017-12-14 Thread Robert Nielsen

Is it possible to start Fuseki with initial contents from a file (an OWL
ontology, not an existing TDB), and then allow updates?

When I start Fuseki with the following parameters:

./fuseki-server --file /Ontologies/MyOntology.owl --update /MyFuseki

I get a message that the the resource /MyFuseki is running in read-only
mode.   It appears the --update parameter does nothing.   But there is no
message that says the parameter is ignored (and why).   I can, of course,
start the Fuseki server and then load the ontology file ... but it seems
like I should be able to do it in one step.   What am I missing?

Running Apache Jena Fuseki 3.5.0 with Java 1.8.0_144 on Mac OS X 10.12.6
x86_64.

Robert Nielsen