Re: Problem with MAX when no result expected

2017-10-07 Thread George News
Hi,

You answer just realized I committed an error while typing the question.
I have just sent a good one.

But from the answer below you confirm that the current Jena output is
the desired behaviour. I still don't know why the aggregate for MAX or
MIN returns one row in result. I will have to accept it ;) although I
consider the MAX of nothing is nothing and therefore there shouldn't be
a row.

Then, is there any way I can check that there is result with empty row
without modifying the pointer in the ResultSet? I don't want to use the
ResultSetRewindable class as this one I understand it copies all the
results in memory being this the way to enable going back to the initial
value.

Thank you very much for the help and sorry for bothering so much.

Regards,
Jorge


On 2017-10-08 01:01, Andy Seaborne wrote:
> If there is an aggregation, you will get one row.
> 
> SELECT (MAX(?x) AS ?M)
> { FILTER(false) }
> 
> ==>
> (sparql --query Q.rq)
> -
> | M |
> =
> |   |
> -
> which is:
> (sparql --query Q.rq --results json)
> {
>   "head": {
>     "vars": [ "M" ]
>   } ,
>   "results": {
>     "bindings": [
>   {
> 
>   }
>     ]
>   }
> }
> 
> and no aggregation:
> 
> SELECT ?x
> { FILTER(false) }
> ==>
> -
> | x |
> =
> -
> which is:
> {
>   "head": {
>     "vars": [ "x" ]
>   } ,
>   "results": {
>     "bindings": [
> 
>     ]
>   }
> }
> 
> 
> Aggregation: no rows in the WHERE, one row in the result
> 
> No aggregation, no rows in WHERE, no rows in the result.
> 
> The details are inconsistent - see below.
> 
> On 07/10/17 23:15, George News wrote:
>> Hi Andy,
>>
>> Now I understand the misunderstanding between you and me. The responses
>> I included in my original mail where wrong :( Please accept my
>> apologizes.
>>
>> These are the right query/responses:
>>
>> # Case 1)
>> select ?id (MAX(?ti) as ?time) ?value ?latitude ?longitude
>> where {
>> ..
>> }--
>> {
>>  "head": {
>>  "vars": [
>>  "id", "time", "value", "latitude", "longitude"
>>  ]
>>  },
>>  "results": {
>>  "bindings": [
>>  {}
> 
> Not ARQ output. See above.
> 
>>  ]
>>  }
>> }
>>
>> # Case 2)
>> select ?id (MAX(?ti) as ?time) ?value ?latitude ?longitude
>> where {
>> ..
>> }--
>> {
>>  "head": {
>>  "vars": [
>>  "id", "value", "latitude", "longitude"
> 
> There is no ?time listed
> 
> Your case two is no aggregation - in which case you get no rows and no
> ?time.
> 
> Now, if you had
> select ?id  ?value ?latitude ?longitude
> 
> and no match in the WHERE, then things are correct.
> 
>>  ]
>>  },
>>  "results": {
>>  "bindings": [
>>  ]
>>  }
>> }
>>
>> Now you can see the difference I was noticing. In the first case it is
>> an empty array (resultset.hasNext() -> false) and the second is an array
>> with an empty object (resultset.hasNext() -> true).
> 
> In the first the array has one item.
> In the second the array has no items.
> 
>> Why is this behaviour? Hope you now understand the issue which in my
>> opinion is a kind of a bug.
> 
> Please provide a complete, verifiable, minimal example.
> 
>>
>> Regards,
>> Jorge
>>
>>
>>
>>
>>
>> On 2017-10-06 16:11, Andy Seaborne wrote:
>>>
>>>
>>> On 06/10/17 12:26, George News wrote:
 On 2017-10-06 11:25, Andy Seaborne wrote:
> The two result sets you show both have one row, with bindings. That's
> consistent with aggregation of nothing (no groups, or if no GROUP
> BY, no
> results from the WHERE pattern.

 I don't see it the same way. The first one (without max) is an empty
 array, while the second (with max) has an array with one object
 (empty).
>>>
>>>  "results": {
>>>  "bindings": [
>>>  {}
>>>  ]
>>>  }
>>>
>>> both times.
>>>
>>> An array of rows, a row is {} i.e. no keys, no variables.
>>>
>>> But the query isn't legal so I don't know what is actually happening.
>>>

>
> MAX() of nothing is unbound but for any aggregation, there always is
> a row/
>
> c.f. COUNT(*) is 0 when there are no solution.
>
> It's just MAX(...) can't return a "there isn't anything value"
>
>   Andy
>

 I see your point as this gives a wrong idea on the result set as it
 really is empty. If I dont get any time I cannot calculate the max of
 nothing. In principle this is what Jena is returning as the object is
 empty, but there should be a way to not get this empty object within
 the
 array of bindings.

 Is there anyway I can check the resultset pointer to get the next()
 value without moving the pointer? I need to know in advance to retrieve
 all the results if there are or aren't any.


>
> On 06/10/17 10:15, George News wrote:
>> Hi all,
>>
>> I am executing a SPARQL with MAX aggregate function and I'm facing a
>> strange behaviour, 

Re: Problem with MAX when no result expected

2017-10-07 Thread George News
Hi,

Forget the last one. I've just realized again I included a mistake
this is the good one (I hope ;))

# Case 1)
select ?id ?value ?latitude ?longitude
where {
..
}
--
{
"head": {
"vars": [
"id", "value", "latitude", "longitude"
]
},
"results": {
"bindings": [
]
}
}

# Case 2)
select ?id (MAX(?ti) as ?time) ?value ?latitude ?longitude
where {
..
}
--
{
"head": {
"vars": [
"id", "time", "value", "latitude", "longitude"
]
},
"results": {
"bindings": [
{}
]
}
}

Now you can see the difference I was noticing. In the first case
bindings is an empty array (resultset.hasNext() -> false) and the second
is an array with an empty object (resultset.hasNext() -> true).

Why is this behaviour? Hope you now understand the issue which in my
opinion is a kind of a bug.

Regards,
Jorge

On 2017-10-08 00:15, George News wrote:
> Hi Andy,
> 
> Now I understand the misunderstanding between you and me. The responses
> I included in my original mail where wrong :( Please accept my apologizes.
> 
> These are the right query/responses:
> 
> # Case 1)
> select ?id (MAX(?ti) as ?time) ?value ?latitude ?longitude
> where {
> ..
> }--
> {
> "head": {
> "vars": [
> "id", "time", "value", "latitude", "longitude"
> ]
> },
> "results": {
> "bindings": [
> {}
> ]
> }
> }
> 
> # Case 2)
> select ?id (MAX(?ti) as ?time) ?value ?latitude ?longitude
> where {
> ..
> }--
> {
> "head": {
> "vars": [
> "id", "value", "latitude", "longitude"
> ]
> },
> "results": {
> "bindings": [
> ]
> }
> }
> 
> Now you can see the difference I was noticing. In the first case it is
> an empty array (resultset.hasNext() -> false) and the second is an array
> with an empty object (resultset.hasNext() -> true).
> 
> Why is this behaviour? Hope you now understand the issue which in my
> opinion is a kind of a bug.
> 
> Regards,
> Jorge
> 
> 
> 
> 
> 
> On 2017-10-06 16:11, Andy Seaborne wrote:
>>
>>
>> On 06/10/17 12:26, George News wrote:
>>> On 2017-10-06 11:25, Andy Seaborne wrote:
 The two result sets you show both have one row, with bindings. That's
 consistent with aggregation of nothing (no groups, or if no GROUP BY, no
 results from the WHERE pattern.
>>>
>>> I don't see it the same way. The first one (without max) is an empty
>>> array, while the second (with max) has an array with one object (empty).
>>
>>     "results": {
>>     "bindings": [
>>     {}
>>     ]
>>     }
>>
>> both times.
>>
>> An array of rows, a row is {} i.e. no keys, no variables.
>>
>> But the query isn't legal so I don't know what is actually happening.
>>
>>>

 MAX() of nothing is unbound but for any aggregation, there always is
 a row/

 c.f. COUNT(*) is 0 when there are no solution.

 It's just MAX(...) can't return a "there isn't anything value"

  Andy

>>>
>>> I see your point as this gives a wrong idea on the result set as it
>>> really is empty. If I dont get any time I cannot calculate the max of
>>> nothing. In principle this is what Jena is returning as the object is
>>> empty, but there should be a way to not get this empty object within the
>>> array of bindings.
>>>
>>> Is there anyway I can check the resultset pointer to get the next()
>>> value without moving the pointer? I need to know in advance to retrieve
>>> all the results if there are or aren't any.
>>>
>>>

 On 06/10/17 10:15, George News wrote:
> Hi all,
>
> I am executing a SPARQL with MAX aggregate function and I'm facing a
> strange behaviour, or at least I think it is.
>
> The snipset of the select variables is the following:
>
> select ?id (MAX(?ti) as ?time) ?value ?latitude ?longitude
> where {
> ..
> }

>
> If I launch the SPARQL query and there are results matching there is no
> problem and I get the expected answer.
>
> However if I launch the same query over another database and there
> should be no match I get the following:
>
> {
>   "head": {
>   "vars": [
>   "id", "time", "value", "latitude", "longitude"
>   ]
>   },
>   "results": {
>   "bindings": [
>   {}
>   ]
>   }
> }
>
> As you can see, although the resultset seems to be empty it is not. It
> is returning one empty object. Actually by checking resultset.hasNext()
> within the code it returns true.
>
> If I remove the MAX function from the variables everything is ok,
> and no
> empty object shows up.
>
> select ?id ?value ?latitude ?longitude
> where {
> ..
> }
> --
> {
>   

Re: Problem with MAX when no result expected

2017-10-07 Thread Andy Seaborne

If there is an aggregation, you will get one row.

SELECT (MAX(?x) AS ?M)
{ FILTER(false) }

==>
(sparql --query Q.rq)
-
| M |
=
|   |
-
which is:
(sparql --query Q.rq --results json)
{
  "head": {
"vars": [ "M" ]
  } ,
  "results": {
"bindings": [
  {

  }
]
  }
}

and no aggregation:

SELECT ?x
{ FILTER(false) }
==>
-
| x |
=
-
which is:
{
  "head": {
"vars": [ "x" ]
  } ,
  "results": {
"bindings": [

]
  }
}


Aggregation: no rows in the WHERE, one row in the result

No aggregation, no rows in WHERE, no rows in the result.

The details are inconsistent - see below.

On 07/10/17 23:15, George News wrote:

Hi Andy,

Now I understand the misunderstanding between you and me. The responses
I included in my original mail where wrong :( Please accept my apologizes.

These are the right query/responses:

# Case 1)
select ?id (MAX(?ti) as ?time) ?value ?latitude ?longitude
where {
..
}--
{
 "head": {
 "vars": [
 "id", "time", "value", "latitude", "longitude"
 ]
 },
 "results": {
 "bindings": [
 {}


Not ARQ output. See above.


 ]
 }
}

# Case 2)
select ?id (MAX(?ti) as ?time) ?value ?latitude ?longitude
where {
..
}--
{
 "head": {
 "vars": [
 "id", "value", "latitude", "longitude"


There is no ?time listed

Your case two is no aggregation - in which case you get no rows and no 
?time.


Now, if you had
select ?id  ?value ?latitude ?longitude

and no match in the WHERE, then things are correct.


 ]
 },
 "results": {
 "bindings": [
 ]
 }
}

Now you can see the difference I was noticing. In the first case it is
an empty array (resultset.hasNext() -> false) and the second is an array
with an empty object (resultset.hasNext() -> true).


In the first the array has one item.
In the second the array has no items.


Why is this behaviour? Hope you now understand the issue which in my
opinion is a kind of a bug.


Please provide a complete, verifiable, minimal example.



Regards,
Jorge





On 2017-10-06 16:11, Andy Seaborne wrote:



On 06/10/17 12:26, George News wrote:

On 2017-10-06 11:25, Andy Seaborne wrote:

The two result sets you show both have one row, with bindings. That's
consistent with aggregation of nothing (no groups, or if no GROUP BY, no
results from the WHERE pattern.


I don't see it the same way. The first one (without max) is an empty
array, while the second (with max) has an array with one object (empty).


     "results": {
     "bindings": [
     {}
     ]
     }

both times.

An array of rows, a row is {} i.e. no keys, no variables.

But the query isn't legal so I don't know what is actually happening.





MAX() of nothing is unbound but for any aggregation, there always is
a row/

c.f. COUNT(*) is 0 when there are no solution.

It's just MAX(...) can't return a "there isn't anything value"

  Andy



I see your point as this gives a wrong idea on the result set as it
really is empty. If I dont get any time I cannot calculate the max of
nothing. In principle this is what Jena is returning as the object is
empty, but there should be a way to not get this empty object within the
array of bindings.

Is there anyway I can check the resultset pointer to get the next()
value without moving the pointer? I need to know in advance to retrieve
all the results if there are or aren't any.




On 06/10/17 10:15, George News wrote:

Hi all,

I am executing a SPARQL with MAX aggregate function and I'm facing a
strange behaviour, or at least I think it is.

The snipset of the select variables is the following:

select ?id (MAX(?ti) as ?time) ?value ?latitude ?longitude
where {
..
}




If I launch the SPARQL query and there are results matching there is no
problem and I get the expected answer.

However if I launch the same query over another database and there
should be no match I get the following:

{
   "head": {
   "vars": [
   "id", "time", "value", "latitude", "longitude"
   ]
   },
   "results": {
   "bindings": [
   {}
   ]
   }
}

As you can see, although the resultset seems to be empty it is not. It
is returning one empty object. Actually by checking resultset.hasNext()
within the code it returns true.

If I remove the MAX function from the variables everything is ok,
and no
empty object shows up.

select ?id ?value ?latitude ?longitude
where {
..
}
--
{
   "head": {
   "vars": [
   "id", "value", "latitude", "longitude"
   ]
   },
   "results": {
   "bindings": [
   {}
   ]
   }
}

Why is happening that? Is this the expected behaviour? I guess it
shouldn't. When you use COUNT funtion it returns 0, but MIN/MAX/etc
arer
different functions and if there is no result nothing should appear.

Any help/tip is more 

Re: Problem with MAX when no result expected

2017-10-07 Thread George News
Hi Andy,

Now I understand the misunderstanding between you and me. The responses
I included in my original mail where wrong :( Please accept my apologizes.

These are the right query/responses:

# Case 1)
select ?id (MAX(?ti) as ?time) ?value ?latitude ?longitude
where {
..
}--
{
"head": {
"vars": [
"id", "time", "value", "latitude", "longitude"
]
},
"results": {
"bindings": [
{}
]
}
}

# Case 2)
select ?id (MAX(?ti) as ?time) ?value ?latitude ?longitude
where {
..
}--
{
"head": {
"vars": [
"id", "value", "latitude", "longitude"
]
},
"results": {
"bindings": [
]
}
}

Now you can see the difference I was noticing. In the first case it is
an empty array (resultset.hasNext() -> false) and the second is an array
with an empty object (resultset.hasNext() -> true).

Why is this behaviour? Hope you now understand the issue which in my
opinion is a kind of a bug.

Regards,
Jorge





On 2017-10-06 16:11, Andy Seaborne wrote:
> 
> 
> On 06/10/17 12:26, George News wrote:
>> On 2017-10-06 11:25, Andy Seaborne wrote:
>>> The two result sets you show both have one row, with bindings. That's
>>> consistent with aggregation of nothing (no groups, or if no GROUP BY, no
>>> results from the WHERE pattern.
>>
>> I don't see it the same way. The first one (without max) is an empty
>> array, while the second (with max) has an array with one object (empty).
> 
>     "results": {
>     "bindings": [
>     {}
>     ]
>     }
> 
> both times.
> 
> An array of rows, a row is {} i.e. no keys, no variables.
> 
> But the query isn't legal so I don't know what is actually happening.
> 
>>
>>>
>>> MAX() of nothing is unbound but for any aggregation, there always is
>>> a row/
>>>
>>> c.f. COUNT(*) is 0 when there are no solution.
>>>
>>> It's just MAX(...) can't return a "there isn't anything value"
>>>
>>>  Andy
>>>
>>
>> I see your point as this gives a wrong idea on the result set as it
>> really is empty. If I dont get any time I cannot calculate the max of
>> nothing. In principle this is what Jena is returning as the object is
>> empty, but there should be a way to not get this empty object within the
>> array of bindings.
>>
>> Is there anyway I can check the resultset pointer to get the next()
>> value without moving the pointer? I need to know in advance to retrieve
>> all the results if there are or aren't any.
>>
>>
>>>
>>> On 06/10/17 10:15, George News wrote:
 Hi all,

 I am executing a SPARQL with MAX aggregate function and I'm facing a
 strange behaviour, or at least I think it is.

 The snipset of the select variables is the following:

 select ?id (MAX(?ti) as ?time) ?value ?latitude ?longitude
 where {
 ..
 }
>>>

 If I launch the SPARQL query and there are results matching there is no
 problem and I get the expected answer.

 However if I launch the same query over another database and there
 should be no match I get the following:

 {
   "head": {
   "vars": [
   "id", "time", "value", "latitude", "longitude"
   ]
   },
   "results": {
   "bindings": [
   {}
   ]
   }
 }

 As you can see, although the resultset seems to be empty it is not. It
 is returning one empty object. Actually by checking resultset.hasNext()
 within the code it returns true.

 If I remove the MAX function from the variables everything is ok,
 and no
 empty object shows up.

 select ?id ?value ?latitude ?longitude
 where {
 ..
 }
 --
 {
   "head": {
   "vars": [
   "id", "value", "latitude", "longitude"
   ]
   },
   "results": {
   "bindings": [
   {}
   ]
   }
 }

 Why is happening that? Is this the expected behaviour? I guess it
 shouldn't. When you use COUNT funtion it returns 0, but MIN/MAX/etc
 arer
 different functions and if there is no result nothing should appear.

 Any help/tip is more than welcome.

 Regards,
 Jorge





>>>
> 


Re: loading many small rdf/xml files

2017-10-07 Thread Martynas Jusevičius
RDF/XML was the first RDF syntax.

On Sat, 7 Oct 2017 at 20.27, Andrew U. Frank 
wrote:

> thank you again!
>
> rereading your answers, i checked on the utilities xargs and riot, which
> i had not ever used before. then i understood your approach (thank you
> for putting the comand line in!) and followed your approach. it indeed
> produces lots of warnings and i had also a hard error in the riot
> output, which i could fix with rapper. then it loaded
>
> still: why would project gutenberg select such a format?
>
> andrew
>
>
>
>
>
> On 10/07/2017 12:52 PM, Andy Seaborne wrote:
> >
> >
> > On 07/10/17 17:06, Andrew U. Frank wrote:
> >> thank you - your link indicates why the solution with calling s-put
> >> for each individual file is so slow.
> >>
> >> practically - i will just wait the 10 hours and then extract the
> >> triples from the store.
> >
> > I admire your patience!
> >
> > I've just downloaded the RDF, converted it to N-triples and loaded it
> > into TDB. 55688 files converted to N-triples : 7,949,706 triples.
> >
> > date ; ( find . -name \*.rdf | xargs riot ) >> data.nt ; date
> >
> > (Load time was 83s / disk is an SSD)
> >
> > Then I loaded it into Fuseki into a different, empty database and it
> > took ~82 seconds (java had already started).
> >
> > There are a few RDF warnings:
> >
> > It uses mixed case host names sometimes:
> >   http://fr.Wikipedia.org
> >
> > Some literals are in non-canonical UTF-8:
> >   "String not in Unicode Normal Form C"
> >
> > Doesn't stop the process - they are only warnings.
> >
> > Andy
> >
> >> can you understand, why somebody would select this format? what is
> >> the advantage?
> >>
> >> andrew
> >>
> >>
> >>
> >> On 10/07/2017 10:52 AM, zPlus wrote:
> >>> Hello Andrew,
> >>>
> >>> if I understand this correctly, I think I stumbled on the same problem
> >>> before. Concatenating XML files will not work indeed. My solution was
> >>> to convert all XML files to N-Triples, then concatenate all those
> >>> triples into a single file, and finally load only this file.
> >>> Ultimately, what I ended up with is this loop [1]. The idea is to call
> >>> RIOT with a list of files as input, instead of calling RIOT on every
> >>> file.
> >>>
> >>> I hope this helps.
> >>>
> >>> 
> >>> [1] https://notabug.org/metadb/pipeline/src/master/build.sh#L54
> >>>
> >>> - Original Message -
> >>> From: users@jena.apache.org
> >>> To:"users@jena.apache.org" 
> >>> Cc:
> >>> Sent:Sat, 7 Oct 2017 10:17:18 -0400
> >>> Subject:loading many small rdf/xml files
> >>>
> >>>   i have to load the Gutenberg projects catalog in rdf/xml format. this
> >>> is
> >>>   a collection of about 50,000 files, each containing a single record
> >>> as
> >>>   attached.
> >>>
> >>>   if i try to concatenate these files into a single one the result is
> >>> not
> >>>   legal rdf/xml - there are xml doc headers:
> >>>
> >>>   http://www.gutenberg.org/;>
> >>>
> >>>   and similar, which can only occur once per file.
> >>>
> >>>   i found a way to load each file individually with s-put and a loop,
> >>> but
> >>>   this runs extremely slowly - it is alrady running for more than 10
> >>>   hours; each file takes half a second to load (fuseki running as
> >>> localhost).
> >>>
> >>>   i am sure there is a better way?
> >>>
> >>>   thank you for the help!
> >>>
> >>>   andrew
> >>>
> >>>   --
> >>>   em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
> >>>   +43 1 58801 12710 direct
> >>>   Geoinformation, TU Wien +43 1 58801 12700 office
> >>>   Gusshausstr. 27-29 +43 1 55801 12799 fax
> >>>   1040 Wien Austria +43 676 419 25 72 mobil
> >>>
> >>>
> >>>
> >>
>
> --
> em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
>   +43 1 58801 12710 direct
> Geoinformation, TU Wien  +43 1 58801 12700 office
> Gusshausstr. 27-29   +43 1 55801 12799 fax
> 1040 Wien Austria+43 676 419 25 72 mobil
>
>


Re: loading many small rdf/xml files

2017-10-07 Thread Andrew U. Frank

thank you again!

rereading your answers, i checked on the utilities xargs and riot, which 
i had not ever used before. then i understood your approach (thank you 
for putting the comand line in!) and followed your approach. it indeed 
produces lots of warnings and i had also a hard error in the riot 
output, which i could fix with rapper. then it loaded


still: why would project gutenberg select such a format?

andrew





On 10/07/2017 12:52 PM, Andy Seaborne wrote:



On 07/10/17 17:06, Andrew U. Frank wrote:
thank you - your link indicates why the solution with calling s-put 
for each individual file is so slow.


practically - i will just wait the 10 hours and then extract the 
triples from the store.


I admire your patience!

I've just downloaded the RDF, converted it to N-triples and loaded it 
into TDB. 55688 files converted to N-triples : 7,949,706 triples.


date ; ( find . -name \*.rdf | xargs riot ) >> data.nt ; date

(Load time was 83s / disk is an SSD)

Then I loaded it into Fuseki into a different, empty database and it 
took ~82 seconds (java had already started).


There are a few RDF warnings:

It uses mixed case host names sometimes:
  http://fr.Wikipedia.org

Some literals are in non-canonical UTF-8:
  "String not in Unicode Normal Form C"

Doesn't stop the process - they are only warnings.

    Andy

can you understand, why somebody would select this format? what is 
the advantage?


andrew



On 10/07/2017 10:52 AM, zPlus wrote:

Hello Andrew,

if I understand this correctly, I think I stumbled on the same problem
before. Concatenating XML files will not work indeed. My solution was
to convert all XML files to N-Triples, then concatenate all those
triples into a single file, and finally load only this file.
Ultimately, what I ended up with is this loop [1]. The idea is to call
RIOT with a list of files as input, instead of calling RIOT on every
file.

I hope this helps.


[1] https://notabug.org/metadb/pipeline/src/master/build.sh#L54

- Original Message -
From: users@jena.apache.org
To:"users@jena.apache.org" 
Cc:
Sent:Sat, 7 Oct 2017 10:17:18 -0400
Subject:loading many small rdf/xml files

  i have to load the Gutenberg projects catalog in rdf/xml format. this
is
  a collection of about 50,000 files, each containing a single record
as
  attached.

  if i try to concatenate these files into a single one the result is
not
  legal rdf/xml - there are xml doc headers:

  http://www.gutenberg.org/;>

  and similar, which can only occur once per file.

  i found a way to load each file individually with s-put and a loop,
but
  this runs extremely slowly - it is alrady running for more than 10
  hours; each file takes half a second to load (fuseki running as
localhost).

  i am sure there is a better way?

  thank you for the help!

  andrew

  --
  em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
  +43 1 58801 12710 direct
  Geoinformation, TU Wien +43 1 58801 12700 office
  Gusshausstr. 27-29 +43 1 55801 12799 fax
  1040 Wien Austria +43 676 419 25 72 mobil







--
em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
 +43 1 58801 12710 direct
Geoinformation, TU Wien  +43 1 58801 12700 office
Gusshausstr. 27-29   +43 1 55801 12799 fax
1040 Wien Austria+43 676 419 25 72 mobil



Re: loading many small rdf/xml files

2017-10-07 Thread Andy Seaborne



On 07/10/17 17:06, Andrew U. Frank wrote:
thank you - your link indicates why the solution with calling s-put for 
each individual file is so slow.


practically - i will just wait the 10 hours and then extract the triples 
from the store.


I admire your patience!

I've just downloaded the RDF, converted it to N-triples and loaded it 
into TDB. 55688 files converted to N-triples : 7,949,706 triples.


date ; ( find . -name \*.rdf | xargs riot ) >> data.nt ; date

(Load time was 83s / disk is an SSD)

Then I loaded it into Fuseki into a different, empty database and it 
took ~82 seconds (java had already started).


There are a few RDF warnings:

It uses mixed case host names sometimes:
  http://fr.Wikipedia.org

Some literals are in non-canonical UTF-8:
  "String not in Unicode Normal Form C"

Doesn't stop the process - they are only warnings.

Andy

can you understand, why somebody would select this format? what is the 
advantage?


andrew



On 10/07/2017 10:52 AM, zPlus wrote:

Hello Andrew,

if I understand this correctly, I think I stumbled on the same problem
before. Concatenating XML files will not work indeed. My solution was
to convert all XML files to N-Triples, then concatenate all those
triples into a single file, and finally load only this file.
Ultimately, what I ended up with is this loop [1]. The idea is to call
RIOT with a list of files as input, instead of calling RIOT on every
file.

I hope this helps.


[1] https://notabug.org/metadb/pipeline/src/master/build.sh#L54

- Original Message -
From: users@jena.apache.org
To:"users@jena.apache.org" 
Cc:
Sent:Sat, 7 Oct 2017 10:17:18 -0400
Subject:loading many small rdf/xml files

  i have to load the Gutenberg projects catalog in rdf/xml format. this
is
  a collection of about 50,000 files, each containing a single record
as
  attached.

  if i try to concatenate these files into a single one the result is
not
  legal rdf/xml - there are xml doc headers:

  http://www.gutenberg.org/;>

  and similar, which can only occur once per file.

  i found a way to load each file individually with s-put and a loop,
but
  this runs extremely slowly - it is alrady running for more than 10
  hours; each file takes half a second to load (fuseki running as
localhost).

  i am sure there is a better way?

  thank you for the help!

  andrew

  --
  em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
  +43 1 58801 12710 direct
  Geoinformation, TU Wien +43 1 58801 12700 office
  Gusshausstr. 27-29 +43 1 55801 12799 fax
  1040 Wien Austria +43 676 419 25 72 mobil







Why RDF/XML? Was: loading many small rdf/xml files

2017-10-07 Thread ajs6f

Simply because it is both XML and RDF.

There is an enormous installed base of expertise and tooling for XML. It's often worth taking advantage of, even if it 
is technically unperformant on a case-by-case basis. If you have to process RDF and you already know a great deal about 
XML and use languages like XSLT or XQuery, reusing them for RDF is very attractive.


Historically, there was an idea of a unified layered architecture to the semantic web activity. I think this Wikipedia 
page: https://en.wikipedia.org/wiki/Semantic_Web_Stack is old enough to portray that idea. I'm not sure anyone now would 
be willing to argue that XML sits under RDF as a syntax layer. (Think about the evolution of JSON and JSON-LD, not shown 
at all on that picture.)



ajs6f

Andrew U. Frank wrote on 10/7/17 12:06 PM:

thank you - your link indicates why the solution with calling s-put for each 
individual file is so slow.

practically - i will just wait the 10 hours and then extract the triples from 
the store.

can you understand, why somebody would select this format? what is the 
advantage?

andrew



On 10/07/2017 10:52 AM, zPlus wrote:

Hello Andrew,

if I understand this correctly, I think I stumbled on the same problem
before. Concatenating XML files will not work indeed. My solution was
to convert all XML files to N-Triples, then concatenate all those
triples into a single file, and finally load only this file.
Ultimately, what I ended up with is this loop [1]. The idea is to call
RIOT with a list of files as input, instead of calling RIOT on every
file.

I hope this helps.


[1] https://notabug.org/metadb/pipeline/src/master/build.sh#L54

- Original Message -
From: users@jena.apache.org
To:"users@jena.apache.org" 
Cc:
Sent:Sat, 7 Oct 2017 10:17:18 -0400
Subject:loading many small rdf/xml files

  i have to load the Gutenberg projects catalog in rdf/xml format. this
is
  a collection of about 50,000 files, each containing a single record
as
  attached.

  if i try to concatenate these files into a single one the result is
not
  legal rdf/xml - there are xml doc headers:

  http://www.gutenberg.org/;>

  and similar, which can only occur once per file.

  i found a way to load each file individually with s-put and a loop,
but
  this runs extremely slowly - it is alrady running for more than 10
  hours; each file takes half a second to load (fuseki running as
localhost).

  i am sure there is a better way?

  thank you for the help!

  andrew

  --
  em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
  +43 1 58801 12710 direct
  Geoinformation, TU Wien +43 1 58801 12700 office
  Gusshausstr. 27-29 +43 1 55801 12799 fax
  1040 Wien Austria +43 676 419 25 72 mobil







Re: loading many small rdf/xml files

2017-10-07 Thread Andrew U. Frank
thank you - your link indicates why the solution with calling s-put for 
each individual file is so slow.


practically - i will just wait the 10 hours and then extract the triples 
from the store.


can you understand, why somebody would select this format? what is the 
advantage?


andrew



On 10/07/2017 10:52 AM, zPlus wrote:

Hello Andrew,

if I understand this correctly, I think I stumbled on the same problem
before. Concatenating XML files will not work indeed. My solution was
to convert all XML files to N-Triples, then concatenate all those
triples into a single file, and finally load only this file.
Ultimately, what I ended up with is this loop [1]. The idea is to call
RIOT with a list of files as input, instead of calling RIOT on every
file.

I hope this helps.


[1] https://notabug.org/metadb/pipeline/src/master/build.sh#L54

- Original Message -
From: users@jena.apache.org
To:"users@jena.apache.org" 
Cc:
Sent:Sat, 7 Oct 2017 10:17:18 -0400
Subject:loading many small rdf/xml files

  i have to load the Gutenberg projects catalog in rdf/xml format. this
is
  a collection of about 50,000 files, each containing a single record
as
  attached.

  if i try to concatenate these files into a single one the result is
not
  legal rdf/xml - there are xml doc headers:

  http://www.gutenberg.org/;>

  and similar, which can only occur once per file.

  i found a way to load each file individually with s-put and a loop,
but
  this runs extremely slowly - it is alrady running for more than 10
  hours; each file takes half a second to load (fuseki running as
localhost).

  i am sure there is a better way?

  thank you for the help!

  andrew

  --
  em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
  +43 1 58801 12710 direct
  Geoinformation, TU Wien +43 1 58801 12700 office
  Gusshausstr. 27-29 +43 1 55801 12799 fax
  1040 Wien Austria +43 676 419 25 72 mobil





--
em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
 +43 1 58801 12710 direct
Geoinformation, TU Wien  +43 1 58801 12700 office
Gusshausstr. 27-29   +43 1 55801 12799 fax
1040 Wien Austria+43 676 419 25 72 mobil



Re: Change DESCRIBE @context

2017-10-07 Thread Andy Seaborne
Try to get it working locally - read and write a file to get it in th 
format you want.


https://jena.apache.org/documentation/io/rdf-output.html#json-ld

  and the example:

https://github.com/apache/jena/blob/master/jena-arq/src-examples/arq/examples/riot/ExJsonLD.java

It is a long path through Fuseki so its hard to see which code is doing 
what.


Note: the json-ld writer is from a different project. Fuseki isn't doing 
anything for JSON-LD except passing it to the writer with the same 
information that is available with Turtle or any other syntax.


Andy

On 07/10/17 16:24, Laura Morales wrote:

Not talking about proposing Fuseki a context/frame. I'm OK with a 
auto-generated context. But Fuseki is creating a new prefix for every URI (see 
previous email). Rather than creating 100s of new prefixes, it would be more 
useful if Fuseki would reuse the same PREFIXes that were already specified with 
my query. To see what I mean, send a DESCRIBE query and compare the Turtle and 
JSON-LD outputs. If my query was

 PREFIX ex: 
 DESCRIBE ...
 FROM ...

Turtle will return predicates such as

  "My name"


then the data has a URI  in it.



whereas JSON-LD will return

  "My name"

because it creates a new prefix (in the context) called

 "name": "http://example.com/vocab/name;

I'm just asking that even the JSON-LD output uses the user defined PREFIXes as 
does Turtle.

Makes sense? Please let me know if I wasn't clear.




Sent: Saturday, October 07, 2017 at 4:49 PM
From: aj...@apache.org
To: users@jena.apache.org
Subject: Re: Change DESCRIBE @context
There's no way (as far as I know) right now to propose a particular context (or 
other profile information) via HTTP when
accepting JSON-LD:

https://github.com/json-ld/json-ld.org/issues/491

Or is your expectation that Jena would somehow figure out to do what you want 
unhinted? If you can define a very clear
and specific algorithm by which Jena could conservatively guess at the right 
way to build a @context, it might be
implementable.


ajs6f



Re: Change DESCRIBE @context

2017-10-07 Thread Laura Morales
Not talking about proposing Fuseki a context/frame. I'm OK with a 
auto-generated context. But Fuseki is creating a new prefix for every URI (see 
previous email). Rather than creating 100s of new prefixes, it would be more 
useful if Fuseki would reuse the same PREFIXes that were already specified with 
my query. To see what I mean, send a DESCRIBE query and compare the Turtle and 
JSON-LD outputs. If my query was

PREFIX ex: 
DESCRIBE ...
FROM ...

Turtle will return predicates such as

 "My name"

whereas JSON-LD will return

 "My name"

because it creates a new prefix (in the context) called

"name": "http://example.com/vocab/name;

I'm just asking that even the JSON-LD output uses the user defined PREFIXes as 
does Turtle.

Makes sense? Please let me know if I wasn't clear.




Sent: Saturday, October 07, 2017 at 4:49 PM
From: aj...@apache.org
To: users@jena.apache.org
Subject: Re: Change DESCRIBE @context
There's no way (as far as I know) right now to propose a particular context (or 
other profile information) via HTTP when
accepting JSON-LD:

https://github.com/json-ld/json-ld.org/issues/491

Or is your expectation that Jena would somehow figure out to do what you want 
unhinted? If you can define a very clear
and specific algorithm by which Jena could conservatively guess at the right 
way to build a @context, it might be
implementable.


ajs6f


Re: loading many small rdf/xml files

2017-10-07 Thread Andy Seaborne
The continual round trip times are more than the time it takes Fuseki to 
perform an update.


On 07/10/17 15:42, aj...@apache.org wrote:

Couple of possibilities:

1) Get something other than RDF/XML from Gutenberg. I don't mean that to 
sound flippant. They may very well maintain some other representation 
(NTriples, Turtle, etc) for their own use and they might be willing to 
share it. It's worth an email. Then use SOH.


2A) Convert your stuff to a single NTriples (streamable) file and load 
it into a TDB database locally, then put it on the server. You can use 
riot to do this (it can accept more than one filename) but with that 
many files, you may need to do it in several stages or groups, or use 
xargs or the like. This may or may not work for you, depending on 
whether you have access to the server to install a TDB database directly 
into Fuseki, or only via HTTP.


2B) Convert your stuff to a single NTriples (streamable) file using riot 
and load it via SOH.


(or load it via the UI).

+1 to Adam's and Martynas's suggestion of preparing a single N-triples 
file. parer each file to N-triples with riot (slight bonus - all riot 
with a number of files at the same time - for various OS reasons, you 
can't give all 50,000 at one time from the command line).


The added benefit here is that the data is checked before loading - even 
the best data does occasionally have errors in it and it is easier to 
notice that before uploading.


You can separately add prefixes by sending a Turtle file of prefixes 
with no triples.


Andy


ajs6f

Andrew U. Frank wrote on 10/7/17 10:17 AM:
i have to load the Gutenberg projects catalog in rdf/xml format. this 
is a collection of about 50,000 files, each

containing a single record as attached.

if i try to concatenate these files into a single one the result is 
not legal rdf/xml - there are xml doc headers:


http://www.gutenberg.org/;>

and similar, which can only occur once per file.

i found a way to load each file individually with s-put and a loop, 
but this runs extremely slowly - it is alrady
running for more than 10 hours; each file takes half a second to load 
(fuseki running as localhost).


i am sure there is a better way?

thank you for the help!

andrew





Re: loading many small rdf/xml files

2017-10-07 Thread zPlus
Hello Andrew,

if I understand this correctly, I think I stumbled on the same problem
before. Concatenating XML files will not work indeed. My solution was
to convert all XML files to N-Triples, then concatenate all those
triples into a single file, and finally load only this file.
Ultimately, what I ended up with is this loop [1]. The idea is to call
RIOT with a list of files as input, instead of calling RIOT on every
file.

I hope this helps.


[1] https://notabug.org/metadb/pipeline/src/master/build.sh#L54

- Original Message -
From: users@jena.apache.org
To:"users@jena.apache.org" 
Cc:
Sent:Sat, 7 Oct 2017 10:17:18 -0400
Subject:loading many small rdf/xml files

 i have to load the Gutenberg projects catalog in rdf/xml format. this
is 
 a collection of about 50,000 files, each containing a single record
as 
 attached.

 if i try to concatenate these files into a single one the result is
not 
 legal rdf/xml - there are xml doc headers:

 http://www.gutenberg.org/;>

 and similar, which can only occur once per file.

 i found a way to load each file individually with s-put and a loop,
but 
 this runs extremely slowly - it is alrady running for more than 10 
 hours; each file takes half a second to load (fuseki running as
localhost).

 i am sure there is a better way?

 thank you for the help!

 andrew

 -- 
 em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
 +43 1 58801 12710 direct
 Geoinformation, TU Wien +43 1 58801 12700 office
 Gusshausstr. 27-29 +43 1 55801 12799 fax
 1040 Wien Austria +43 676 419 25 72 mobil




Re: Change DESCRIBE @context

2017-10-07 Thread ajs6f
There's no way (as far as I know) right now to propose a particular context (or other profile information) via HTTP when 
accepting JSON-LD:


https://github.com/json-ld/json-ld.org/issues/491

Or is your expectation that Jena would somehow figure out to do what you want unhinted? If you can define a very clear 
and specific algorithm by which Jena could conservatively guess at the right way to build a @context, it might be 
implementable.



ajs6f

Laura Morales wrote on 10/7/17 10:21 AM:

The problem is that Fuseki (when I select JSON-LD output format) creates a 
@context with as many properties as there are URIs. For example


  "@context" : {
"name" : {
  "@id" : "example.org/vocab/name",
  "@type" : ...
},
"surname" : {
  "@id" : "example.org/vocab/surname",
  "@type" : ...
},
"age" : {
  "@id" : "example.org/vocab/age",
  "@type" : ...
},
"ex": "example.org/vocab/"
  }


whereas all I want is


  "@context" : {
"ex": "example.org/vocab/"
  }




Sent: Saturday, October 07, 2017 at 12:37 PM
From: "Andy Seaborne" 
To: users@jena.apache.org
Subject: Re: Change DESCRIBE @context
The result Model from DESCRIBE has the prefixes of the data and the
query. There can be multiple prefixes for the same URI.

How that gets processed by JSON-LD is another matter and I don't know
the details.

Andy



Re: loading many small rdf/xml files

2017-10-07 Thread ajs6f

Couple of possibilities:

1) Get something other than RDF/XML from Gutenberg. I don't mean that to sound flippant. They may very well maintain 
some other representation (NTriples, Turtle, etc) for their own use and they might be willing to share it. It's worth an 
email. Then use SOH.


2A) Convert your stuff to a single NTriples (streamable) file and load it into a TDB database locally, then put it on 
the server. You can use riot to do this (it can accept more than one filename) but with that many files, you may need to 
do it in several stages or groups, or use xargs or the like. This may or may not work for you, depending on whether you 
have access to the server to install a TDB database directly into Fuseki, or only via HTTP.


2B) Convert your stuff to a single NTriples (streamable) file using riot and 
load it via SOH.

ajs6f

Andrew U. Frank wrote on 10/7/17 10:17 AM:

i have to load the Gutenberg projects catalog in rdf/xml format. this is a 
collection of about 50,000 files, each
containing a single record as attached.

if i try to concatenate these files into a single one the result is not legal 
rdf/xml - there are xml doc headers:

http://www.gutenberg.org/;>

and similar, which can only occur once per file.

i found a way to load each file individually with s-put and a loop, but this 
runs extremely slowly - it is alrady
running for more than 10 hours; each file takes half a second to load (fuseki 
running as localhost).

i am sure there is a better way?

thank you for the help!

andrew





Re: loading many small rdf/xml files

2017-10-07 Thread Martynas Jusevičius
Run a script to convert them to N-Triples and then another to concatenate
the files?

On Sat, Oct 7, 2017 at 4:17 PM, Andrew U. Frank 
wrote:

> i have to load the Gutenberg projects catalog in rdf/xml format. this is a
> collection of about 50,000 files, each containing a single record as
> attached.
>
> if i try to concatenate these files into a single one the result is not
> legal rdf/xml - there are xml doc headers:
>
> http://www.gutenberg.org/;>
>
> and similar, which can only occur once per file.
>
> i found a way to load each file individually with s-put and a loop, but
> this runs extremely slowly - it is alrady running for more than 10 hours;
> each file takes half a second to load (fuseki running as localhost).
>
> i am sure there is a better way?
>
> thank you for the help!
>
> andrew
>
>
>
> --
> em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
>  +43 1 58801 12710 direct
> Geoinformation, TU Wien  +43 1 58801 12700 office
> Gusshausstr. 27-29   +43 1 55801 12799 fax
> 1040 Wien Austria+43 676 419 25 72 mobil
>
>


Re: Change DESCRIBE @context

2017-10-07 Thread Laura Morales
The problem is that Fuseki (when I select JSON-LD output format) creates a 
@context with as many properties as there are URIs. For example


  "@context" : {
"name" : {
  "@id" : "example.org/vocab/name",
  "@type" : ...
},
"surname" : {
  "@id" : "example.org/vocab/surname",
  "@type" : ...
},
"age" : {
  "@id" : "example.org/vocab/age",
  "@type" : ...
},
"ex": "example.org/vocab/"
  }


whereas all I want is


  "@context" : {
"ex": "example.org/vocab/"
  }




Sent: Saturday, October 07, 2017 at 12:37 PM
From: "Andy Seaborne" 
To: users@jena.apache.org
Subject: Re: Change DESCRIBE @context
The result Model from DESCRIBE has the prefixes of the data and the
query. There can be multiple prefixes for the same URI.

How that gets processed by JSON-LD is another matter and I don't know
the details.

Andy


loading many small rdf/xml files

2017-10-07 Thread Andrew U. Frank
i have to load the Gutenberg projects catalog in rdf/xml format. this is 
a collection of about 50,000 files, each containing a single record as 
attached.


if i try to concatenate these files into a single one the result is not 
legal rdf/xml - there are xml doc headers:


http://www.gutenberg.org/;>

and similar, which can only occur once per file.

i found a way to load each file individually with s-put and a loop, but 
this runs extremely slowly - it is alrady running for more than 10 
hours; each file takes half a second to load (fuseki running as localhost).


i am sure there is a better way?

thank you for the help!

andrew



--
em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
 +43 1 58801 12710 direct
Geoinformation, TU Wien  +43 1 58801 12700 office
Gusshausstr. 27-29   +43 1 55801 12799 fax
1040 Wien Austria+43 676 419 25 72 mobil



pg9630.rdf
Description: application/rdf


Re: Change DESCRIBE @context

2017-10-07 Thread Andy Seaborne
The result Model from DESCRIBE has the prefixes of the data and the 
query.  There can be multiple prefixes for the same URI.


How that gets processed by JSON-LD is another matter and I don't know 
the details.


Andy

On 07/10/17 10:27, Laura Morales wrote:

When I query Fuseki like this "DESCRIBE <> FROM <>" and return a JSON-LD, it looks like the 
"@context" is generated automatically and it ignores any "PREFIX" declaration.
For example if I have "PREFIX foo: " it will create the property 
"Person" instead of "foo:Person".

Is there any way that I can tweak my DESCRIBE query to return properties as "foo:Person" 
instead of the automatically generated "Person"?



Change DESCRIBE @context

2017-10-07 Thread Laura Morales
When I query Fuseki like this "DESCRIBE <> FROM <>" and return a JSON-LD, it 
looks like the "@context" is generated automatically and it ignores any 
"PREFIX" declaration.
For example if I have "PREFIX foo: " it will 
create the property "Person" instead of "foo:Person".

Is there any way that I can tweak my DESCRIBE query to return properties as 
"foo:Person" instead of the automatically generated "Person"?