Re: [Wikidata] Wikidata Digest, Vol 80, Issue 24

2018-07-28 Thread haimz
ly if they work
> for "greater units", that is, I search for a year and get an item back,
> even though the statament is month- or day-precise

This is something I've been thinking about for a while, mainly because
the way we index dates now does not serve some important use cases. Even
in the Query Service we treat dates as fixed instants on the time scale,
whereas some dates are not instants but intervals (which in captured in
wikidata Precision but we are currently not paying any attention to it),
in fact many of the dates we use are more of interval-y nature than
instant-y.

This makes searching for "somebody that was born in 1820" possible but
laborious (you need to do intervals manually) and inefficient since we
can't just look up by year.

There are certainly improvement possible in this area, not yet sure how
to do it though.

-- 
Stas Malyshev
smalys...@wikimedia.org



--

Message: 12
Date: Sat, 28 Jul 2018 12:42:52 +0200
From: David Causse 
To: Stas Malyshev 
Cc: Internal communications for WMF search and discovery team
, wikidata@lists.wikimedia.org
Subject: Re: [Wikidata] [discovery-private] Indexing all item
properties in   ElasticSearch
Message-ID:

Content-Type: text/plain; charset="utf-8"

On Sat, Jul 28, 2018 at 2:02 AM Stas Malyshev 
wrote:

> Hi!
>
> > The top 1000
> > is:
> https://docs.google.com/spreadsheets/d/1E58W_t_o6vTNUAx_TG3ifW6-eZE4KJ2VGEaBX_74YkY/edit?usp=sharing
>
> This one is pretty interesting, how do I extract this data? It may be
> useful independently of what we're discussing here.
>

This can be extracted from elastic using aggregations, to obtain a top1000
of the terms that do match P21= or P279 you can run this:
 curl -XPOST 'localhost:9200/wikidatawiki_content/_search?size=0&pretty' -d
'{"aggs": {"item_usage": { "terms": { "field": "statement_keywords",
"exclude": "P(31|279)=.*", "size": 1000 ' > top1k.json

To obtain an approximation of the cardinality (unique terms) of a field:

curl -XPOST localhost:9200/wikidatawiki_content/_search?size=0 -d '{"aggs":
{"item_usage": { "cardinality": { "field": "statement_keywords" '

Note that I used the spare cluster to run these.
As for Property usage I just realized that we the outgoing_link which
contains a array like:
outgoing_link": ["Q1355298","Q1379672","Q15241312","Q8844594","Property:P18"
,"Property:P1889","Property:P248","Property:P2612","Property:P279","
Property:P3221","Property:P3417","Property:P373","Property:P3827","
Property:P577","Property:P646","Property:P910"],
We don't have doc values enabled for this one so we can't extract
aggregations but if the list of terms is known it could be easily extracted
by running X count queries where X is the number of possible possible
properties.
-- next part --
An HTML attachment was scrubbed...
URL: 
<https://lists.wikimedia.org/pipermail/wikidata/attachments/20180728/9284cd0f/attachment-0001.html>

--

Message: 13
Date: Sat, 28 Jul 2018 13:13:29 +0200
From: Ettore RIZZA 
To: "Discussion list for the Wikidata project."

Subject: Re: [Wikidata] Wikidata in the LOD Cloud
Message-ID:

Content-Type: text/plain; charset="utf-8"

Dear all,

stop me if my question is naive or stupid. But I see that a dataset like
Europeana is both in the Lod Cloud and as a property in Wikidata
<https://www.wikidata.org/wiki/Property:P727>. However, the method using
the "Formatter URL for RDF resource" property does not work because this
property is missing from Europeana ID. How many other cases like this?

But I see in this simplified version of the Lod Cloud
<https://jqplay.org/s/bgiJvPKryC> that each dataset has a namespace. Would
not it be more efficient to match Wikidata and Lod Cloud using this
namespaces in a series of Sparql queries <http://tinyurl.com/y8taazzm>?

Cheers,

Ettore

On Mon, 9 Jul 2018 at 14:07, Lucas Werkmeister 
wrote:

> On 27.06.2018 22:40, Federico Leva (Nemo) wrote:
> > Maarten Dammers, 27/06/2018 23:26:
> >> Excellent news! https://lod-cloud.net/dataset/wikidata seems to
> >> contain the info in a more human readable (and machine readable) way.
> >> If we add some URI link, does it automagically appear or does Lucas
> >> has to do some manual work? I assume Lucas has to do some manual work.
> >
> > I'd also be curious what to do when a property does not have a node in
> > the LOD cloud, for instance P2948 is among the 77 results for P1921
> > but I don't see any corresponding URL in
> > http://lod-cloud.net/versions/2018-30-05/lod-data.json
>
> Previously it was manual work, yes, and for properties not in the LOD
> cloud I added commented-out entries to the page source of
> https://www.wikidata.org/wiki/User:Lucas_Werkmeister_(WMDE)/LOD_Cloud.
> I’ll try to resubmit Wikidata now and see how the submission process has
> evolved.
>
> Cheers, Lucas
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
-- next part --
An HTML attachment was scrubbed...
URL: 
<https://lists.wikimedia.org/pipermail/wikidata/attachments/20180728/2d629031/attachment-0001.html>

--

Subject: Digest Footer

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


--

End of Wikidata Digest, Vol 80, Issue 24



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Indexing all item properties in ElasticSearch

2018-07-28 Thread Lydia Pintscher
Thanks a lot for looking into this, Stas!

On Thu, Jul 26, 2018 at 11:49 PM Stas Malyshev  wrote:
> So, we have two questions:
> 1. Do we want to enable indexing for all item properties? Note that if
> you just want to find items with certain statement values, Wikidata
> Query Service matches this use case best. It's only in combination with
> actual fulltext search where on-wiki search is better.

I would say yes.

> 2. Do we need to index P2860 and P1433 at all, and if so, would it be ok
> if we omit indexing for now?

Yes it should be perfectly fine to go without these for now - maybe
always. They're mostly (only?) used on the large corpus of scientific
papers.


Cheers
Lydia

-- 
Lydia Pintscher - http://about.me/lydia.pintscher
Product Manager for Wikidata

Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata in the LOD Cloud

2018-07-28 Thread Ettore RIZZA
Dear all,

stop me if my question is naive or stupid. But I see that a dataset like
Europeana is both in the Lod Cloud and as a property in Wikidata
. However, the method using
the "Formatter URL for RDF resource" property does not work because this
property is missing from Europeana ID. How many other cases like this?

But I see in this simplified version of the Lod Cloud
 that each dataset has a namespace. Would
not it be more efficient to match Wikidata and Lod Cloud using this
namespaces in a series of Sparql queries ?

Cheers,

Ettore

On Mon, 9 Jul 2018 at 14:07, Lucas Werkmeister 
wrote:

> On 27.06.2018 22:40, Federico Leva (Nemo) wrote:
> > Maarten Dammers, 27/06/2018 23:26:
> >> Excellent news! https://lod-cloud.net/dataset/wikidata seems to
> >> contain the info in a more human readable (and machine readable) way.
> >> If we add some URI link, does it automagically appear or does Lucas
> >> has to do some manual work? I assume Lucas has to do some manual work.
> >
> > I'd also be curious what to do when a property does not have a node in
> > the LOD cloud, for instance P2948 is among the 77 results for P1921
> > but I don't see any corresponding URL in
> > http://lod-cloud.net/versions/2018-30-05/lod-data.json
>
> Previously it was manual work, yes, and for properties not in the LOD
> cloud I added commented-out entries to the page source of
> https://www.wikidata.org/wiki/User:Lucas_Werkmeister_(WMDE)/LOD_Cloud.
> I’ll try to resubmit Wikidata now and see how the submission process has
> evolved.
>
> Cheers, Lucas
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [discovery-private] Indexing all item properties in ElasticSearch

2018-07-28 Thread David Causse
On Sat, Jul 28, 2018 at 2:02 AM Stas Malyshev 
wrote:

> Hi!
>
> > The top 1000
> > is:
> https://docs.google.com/spreadsheets/d/1E58W_t_o6vTNUAx_TG3ifW6-eZE4KJ2VGEaBX_74YkY/edit?usp=sharing
>
> This one is pretty interesting, how do I extract this data? It may be
> useful independently of what we're discussing here.
>

This can be extracted from elastic using aggregations, to obtain a top1000
of the terms that do match P21= or P279 you can run this:
 curl -XPOST 'localhost:9200/wikidatawiki_content/_search?size=0&pretty' -d
'{"aggs": {"item_usage": { "terms": { "field": "statement_keywords",
"exclude": "P(31|279)=.*", "size": 1000 ' > top1k.json

To obtain an approximation of the cardinality (unique terms) of a field:

curl -XPOST localhost:9200/wikidatawiki_content/_search?size=0 -d '{"aggs":
{"item_usage": { "cardinality": { "field": "statement_keywords" '

Note that I used the spare cluster to run these.
As for Property usage I just realized that we the outgoing_link which
contains a array like:
outgoing_link": ["Q1355298","Q1379672","Q15241312","Q8844594","Property:P18"
,"Property:P1889","Property:P248","Property:P2612","Property:P279","
Property:P3221","Property:P3417","Property:P373","Property:P3827","
Property:P577","Property:P646","Property:P910"],
We don't have doc values enabled for this one so we can't extract
aggregations but if the list of terms is known it could be easily extracted
by running X count queries where X is the number of possible possible
properties.
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata