Re: [Wikidata] ntriples dump?

2016-08-27 Thread Stas Malyshev
Hi!

Looks like the feedback to the idea has been positive (thanks to
everybody that participated!) so I've made a task to track it:

https://phabricator.wikimedia.org/T144103

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] ntriples dump?

2016-08-27 Thread Stas Malyshev
Hi!

> out of curiosity, can you give an example of triples that do not
> originate from a single wikidata item / property?

All references and values can be shared between items. E.g. if two items
refer to the same date, they will refer to the same value node. Same if
they have a reference with same properties - i.e. one URL to the same
address.
These nodes do not have their own documents - since in Wikibase and
Wikidata it's not possible to address individual values/references - but
they are not linked to a single entity.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] ntriples dump?

2016-08-27 Thread Neil Harris

On 27/08/16 10:56, Markus Kroetzsch wrote:

On 26.08.2016 22:32, Aidan Hogan wrote:
...


tl;dr:
N-Triples or N-Triples + Turtle sounds good.
N-Quads would be a bonus if easy to do.


+1 to all of this

Best,

Markus


Also, if we are having new dump formats, it might also be worth 
considering using better compression, particularly for fully-expanded 
formats like n-triples. Would, for example, .7z compression give 
significantly better results than .bz2 on this data?


Neil


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] ntriples dump?

2016-08-27 Thread Dimitris Kontokostas
Hi Stats,

out of curiosity, can you give an example of triples that do not originate
from a single wikidata item / property?

for me turtle dumps are process-able only by RDF tools while nt-like dumps
both by rdf tools and other kind of scripts and I fild the former redundant

On Fri, Aug 26, 2016 at 11:52 PM, Stas Malyshev 
wrote:

> Hi!
>
> > Of course if providing both is easy, then there's no reason not to
> > provide both.
>
> Technically it's quite easy - you just run the same script with
> different options. So the only question is what is useful.
>
> > It is useful in such applications to know the online RDF documents in
> > which a triple can be found. The document could be the entity, or it
> > could be a physical location like:
> >
> > http://www.wikidata.org/entity/Q13794921.ttl
>
> That's where the tricky part is: many triples won't have specific
> document there since they may appear in many documents. Of course, if
> you merge all these documents in a dump, the triple would appear only
> once (we have special deduplication code to take care of that) but it's
> impossible to track it back to a specific document then. So I understand
> the idea, and see how it may be useful, but I don't see a real way to
> implement it now.
>
> --
> Stas Malyshev
> smalys...@wikimedia.org
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>



-- 
Kontokostas Dimitris
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] ntriples dump?

2016-08-27 Thread Fariz Darari
Hello Stas,

+1 for .nt RDF dump of WD due to (as you also said) easier processing!

Regards,
Fariz

On Fri, Aug 26, 2016 at 10:52 PM, Stas Malyshev 
wrote:

> Hi!
>
> > Of course if providing both is easy, then there's no reason not to
> > provide both.
>
> Technically it's quite easy - you just run the same script with
> different options. So the only question is what is useful.
>
> > It is useful in such applications to know the online RDF documents in
> > which a triple can be found. The document could be the entity, or it
> > could be a physical location like:
> >
> > http://www.wikidata.org/entity/Q13794921.ttl
>
> That's where the tricky part is: many triples won't have specific
> document there since they may appear in many documents. Of course, if
> you merge all these documents in a dump, the triple would appear only
> once (we have special deduplication code to take care of that) but it's
> impossible to track it back to a specific document then. So I understand
> the idea, and see how it may be useful, but I don't see a real way to
> implement it now.
>
> --
> Stas Malyshev
> smalys...@wikimedia.org
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] ntriples dump?

2016-08-26 Thread Stas Malyshev
Hi!

> Of course if providing both is easy, then there's no reason not to
> provide both.

Technically it's quite easy - you just run the same script with
different options. So the only question is what is useful.

> It is useful in such applications to know the online RDF documents in
> which a triple can be found. The document could be the entity, or it
> could be a physical location like:
> 
> http://www.wikidata.org/entity/Q13794921.ttl

That's where the tricky part is: many triples won't have specific
document there since they may appear in many documents. Of course, if
you merge all these documents in a dump, the triple would appear only
once (we have special deduplication code to take care of that) but it's
impossible to track it back to a specific document then. So I understand
the idea, and see how it may be useful, but I don't see a real way to
implement it now.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] ntriples dump?

2016-08-26 Thread Stas Malyshev
Hi!

> I think in terms of the dump, /replacing/ the Turtle dump with the
> N-Triples dump would be a good option. (Not sure if that's what you were
> suggesting?)

No, I'm suggesting having both. Turtle is easier to comprehend and also
more compact for download, etc. (though I didn't check how much is the
difference - compressed it may not be that big).

> to have both: existing tools expecting Turtle shouldn't have a problem
> with N-Triples.

That depends on whether these tools actually understand RDF - some might
be more simplistic (with text-based formats, you can achieve a lot even
with dumber tools). But that definitely might be an option too. I'm not
sure if it's the best one but a possibility, so we may discuss it too.

> (Also just to put the idea out there of perhaps (also) having N-Quads
> where the fourth element indicates the document from which the RDF graph
> can be dereferenced. This can be useful for a tool that, e.g., just

What you mean by "document" - like entity? That may be a problem since
some data - like references and values, or property definitions - can be
used by more than one entity. So it's not that trivial to extract all
data regarding one entity from the dump. You can do it via export, e.g.:
http://www.wikidata.org/entity/Q42?flavor=full - but that doesn't
extract it, it just generates it.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] ntriples dump?

2016-08-26 Thread Aidan Hogan

Hi Stas,

I think in terms of the dump, /replacing/ the Turtle dump with the 
N-Triples dump would be a good option. (Not sure if that's what you were 
suggesting?)


As you already mentioned, N-Triples is easier to process with typical 
unix command-line tools and scripts, etc. But also any (RDF 1.1) 
N-Triples file should be valid Turtle, so I don't see a convincing need 
to have both: existing tools expecting Turtle shouldn't have a problem 
with N-Triples.


(Also just to put the idea out there of perhaps (also) having N-Quads 
where the fourth element indicates the document from which the RDF graph 
can be dereferenced. This can be useful for a tool that, e.g., just 
wants to quickly refresh a single graph from the dump, or more generally 
that wants to keep track of a simple and quick notion of provenance: 
"this triple was found in this Web document".)


Cheers,
Aidan

On 26-08-2016 16:30, Stas Malyshev wrote:

Hi!

I was thinking recently about various data processing scenarios in
wikidata and there's one case we don't have a good coverage for I think.

TLDR: One of the things I think we might do to make it easier to work
with data is having ntriples (line-based) RDF dump format available.

If you need to process a lot of data (like all enwiki sitelinks, etc.)
then the Query Service is not very efficient there, due to limits and
sheer volume of data. We could increase limits but not by much - I don't
think we can allow a 30-minute processing task to hog the resources of
the service to itself. We have some ways to mitigate this, in theory,
but in practice they'll take time to be implemented and deployed.

The other approach would be to do dump processing. Which would work in
most scenarios but the problem is that we have two forms of dump right
now - JSON and TTL (Turtle) and both are not easy to process without
tools with deep understanding of the formats. For JSON, we have Wikidata
Toolkit but it can't ingest RDF/Turtle, and also has some entry barrier
to get everything running even when operation that needs to be done is
trivial.

So I was thinking - what if we had also ntriples RDF dump? The
difference between ntriples and Turtle is that ntriples is line-based
and fully expanded - which means every line can be understood on its own
without needing any context. This enables to process the dump using the
most basic text processing tools or any software that can read a line of
text and apply regexp to it. The downside of ntriples is it's really
verbose, but compression will take care of most of it, and storing
another 10-15G or so should not be a huge deal. Also, current code
already knows how to generate ntriples dump (in fact, almost all unit
tests internally use this format) - we just need to create a job that
actually generates it.

Of course, with right tools you can generate ntriples dump from both
Turtle one and JSON one (Wikidata toolkit can do the latter, IIRC) but
it's one more moving part which makes it harder and introduces potential
for inconsistencies and surprises.

So, what do you think - would having ntriples RDF dump for wikidata help
things?



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] ntriples dump?

2016-08-26 Thread Stas Malyshev
Hi!

I was thinking recently about various data processing scenarios in
wikidata and there's one case we don't have a good coverage for I think.

TLDR: One of the things I think we might do to make it easier to work
with data is having ntriples (line-based) RDF dump format available.

If you need to process a lot of data (like all enwiki sitelinks, etc.)
then the Query Service is not very efficient there, due to limits and
sheer volume of data. We could increase limits but not by much - I don't
think we can allow a 30-minute processing task to hog the resources of
the service to itself. We have some ways to mitigate this, in theory,
but in practice they'll take time to be implemented and deployed.

The other approach would be to do dump processing. Which would work in
most scenarios but the problem is that we have two forms of dump right
now - JSON and TTL (Turtle) and both are not easy to process without
tools with deep understanding of the formats. For JSON, we have Wikidata
Toolkit but it can't ingest RDF/Turtle, and also has some entry barrier
to get everything running even when operation that needs to be done is
trivial.

So I was thinking - what if we had also ntriples RDF dump? The
difference between ntriples and Turtle is that ntriples is line-based
and fully expanded - which means every line can be understood on its own
without needing any context. This enables to process the dump using the
most basic text processing tools or any software that can read a line of
text and apply regexp to it. The downside of ntriples is it's really
verbose, but compression will take care of most of it, and storing
another 10-15G or so should not be a huge deal. Also, current code
already knows how to generate ntriples dump (in fact, almost all unit
tests internally use this format) - we just need to create a job that
actually generates it.

Of course, with right tools you can generate ntriples dump from both
Turtle one and JSON one (Wikidata toolkit can do the latter, IIRC) but
it's one more moving part which makes it harder and introduces potential
for inconsistencies and surprises.

So, what do you think - would having ntriples RDF dump for wikidata help
things?
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata