Re: [Wikidata] WDQS status

2020-07-09 Thread Jan Macura
Dear Guillaume,

On Thu, 9 Jul 2020 at 15:23, Guillaume Lederrey 
wrote:

> We've been hard at work on Wikimedia Commons Query Service (WCQS) [2].
> This will be a SPARL endpoint similar to WDQS, but serving the Structured
> Data on Commons dataset. Our goal is to open a beta service, hosted on
> Wikimedia Cloud Service (WMCS) by the end of July. The service will require
> an account on Commons for authentication and will allow federation with
> WDQS. We don't have a streaming update process ready yet, the data will be
> reloaded from Commons dumps weekly for a start.
>

I haven't seen this coming. Applause for you and your team!

Jan
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] WDQS status

2020-07-09 Thread Guillaume Lederrey
On Thu, Jul 9, 2020 at 4:52 PM Egon Willighagen 
wrote:

>
> Dear Guillaume,
>
> On Thu, Jul 9, 2020 at 3:23 PM Guillaume Lederrey 
> wrote:
>
>> Some very preliminary analysis indicates that less then 2% of the queries
>> on WDQS generate more than 90% of the load. This is definitely something we
>> need to better understand.
>>
>
> Is the data behind that available? I wonder if I recognize any of the top
> 25 queries.
>

No, the data isn't publicly available. Queries can (and do) contain private
information, so we don't publish raw queries. We might publish a subset of
those queries at some point, but only after having reviewed them manually
to ensure they are clean.

(I guess the top 2% can be simple queries run very many times, as well as
> hard queries rarely run, correct?)
>

The analysis at this point is just on individual queries, with no
aggregation of similar queries. This means that this 2% of queries are very
expensive queries. We need to refine that analysis, and aggregation of
similar queries is one of the things we should be working on.


> Egon
>
>
> --
> Hi, do you like citation networks? Already 51% of all citations are
> available  available for innovative new uses
> . Join me in asking the American
> Chemical Society to join the Initiative for Open Citations too
> .
>  SpringerNature,
> the RSC and many others already did .
>
> -
> E.L. Willighagen
> Department of Bioinformatics - BiGCaT
> Maastricht University (http://www.bigcat.unimaas.nl/)
> Homepage: http://egonw.github.com/
> Blog: http://chem-bla-ics.blogspot.com/
> PubList: https://www.zotero.org/egonw
> ORCID: -0001-7542-0286 
> ImpactStory: https://impactstory.org/u/egonwillighagen
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>


-- 
Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation
UTC+1 / CET
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] WDQS status

2020-07-09 Thread Egon Willighagen
Dear Guillaume,

On Thu, Jul 9, 2020 at 3:23 PM Guillaume Lederrey 
wrote:

> Some very preliminary analysis indicates that less then 2% of the queries
> on WDQS generate more than 90% of the load. This is definitely something we
> need to better understand.
>

Is the data behind that available? I wonder if I recognize any of the top
25 queries.

(I guess the top 2% can be simple queries run very many times, as well as
hard queries rarely run, correct?)

Egon


-- 
Hi, do you like citation networks? Already 51% of all citations are
available  available for innovative new uses
. Join me in asking the American
Chemical Society to join the Initiative for Open Citations too
.
SpringerNature,
the RSC and many others already did .

-
E.L. Willighagen
Department of Bioinformatics - BiGCaT
Maastricht University (http://www.bigcat.unimaas.nl/)
Homepage: http://egonw.github.com/
Blog: http://chem-bla-ics.blogspot.com/
PubList: https://www.zotero.org/egonw
ORCID: -0001-7542-0286 
ImpactStory: https://impactstory.org/u/egonwillighagen
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] WDQS status

2020-07-09 Thread Guillaume Lederrey
On Thu, Jul 9, 2020 at 3:35 PM Gerard Meijssen 
wrote:

> Hoi,
> Is this different from Special:MediaSearch ??
>

I'm assuming that you are asking if the new WCQS is different from the
Special:MediaSearch prototype [1].

And yes, it is quite different. WCQS is a low level SPARQL interface,
oriented toward power users and tools, allowing federation with WDQS and
the Wikdiata dataset. Special:MediaSearch is a higher level search
interface, backed by elasticsearch. It is using the same underlying data,
but in a very different way.

Somewhat unrelated: we are also planning some work on Special:MediaSearch
to better integrate is with our current search infrastructure [2].

[1] https://commons.wikimedia.org/wiki/Special:MediaSearch
[2] https://phabricator.wikimedia.org/T257043

Thanks,
>   GerardM
>
> On Thu, 9 Jul 2020 at 15:23, Guillaume Lederrey 
> wrote:
>
>> Hello all!
>>
>> The Search Platform team will join the WIkidata office hours on July 21st
>> 16:00 UTC [1]. We are looking forward to discussing Wikidata Query Service
>> and anything else you might find of interest.
>>
>> We've been hard at work on Wikimedia Commons Query Service (WCQS) [2].
>> This will be a SPARL endpoint similar to WDQS, but serving the Structured
>> Data on Commons dataset. Our goal is to open a beta service, hosted on
>> Wikimedia Cloud Service (WMCS) by the end of July. The service will require
>> an account on Commons for authentication and will allow federation with
>> WDQS. We don't have a streaming update process ready yet, the data will be
>> reloaded from Commons dumps weekly for a start.
>>
>> As part of that work, the dumps for Structured Data on Commons are now
>> available [3]. Note that the prefix used in the TTL dumps is "wd", which
>> does not make much sense. We are working with WMDE on renaming the
>> prefixes, but this is more complex than expected since "wd" is hardcoded in
>> more places than it should be. Those prefix should only be valid in the
>> local context of the dumps, so renaming them is technically a non breaking
>> change. That being said, if you start using those dumps, make sure you
>> don't rely on this prefix, or that you are ready for a rename [4].
>>
>> We are planning to dig more into the data we have to get a better
>> understanding of the use cases around WDQS [5] (not much content on that
>> task yet, but it is coming). Some very preliminary analysis indicates that
>> less then 2% of the queries on WDQS generate more than 90% of the load.
>> This is definitely something we need to better understand. We will be
>> working on defining the kind of questions we need to answer, and improving
>> our data collection to be able to answer those questions.
>>
>> We have started an internal discussion around "planning for disaster"
>> [6]. We want to better understand the potential failure scenarios around
>> WDQS and have a plan if that worst case does happen. This will include some
>> analytics work and some testing to better understand the constraints and
>> what degraded mode we might still be able to provide in case of
>> catastrophic failure.
>>
>> Thanks for reading!
>>
>>Guillaume
>>
>> [1] https://www.wikidata.org/wiki/Wikidata:Events#Office_hours
>> [2] https://phabricator.wikimedia.org/T251488
>> [3] https://dumps.wikimedia.org/other/wikibase/commonswiki/
>> [4]
>> https://dumps.wikimedia.org/other/wikibase/commonswiki/README_commonsrdfdumps.txt
>> [5] https://phabricator.wikimedia.org/T257045
>> [6] https://phabricator.wikimedia.org/T257055
>>
>>
>> --
>> Guillaume Lederrey
>> Engineering Manager, Search Platform
>> Wikimedia Foundation
>> UTC+1 / CET
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>


-- 
Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation
UTC+1 / CET
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] WDQS status

2020-07-09 Thread Gerard Meijssen
Hoi,
Is this different from Special:MediaSearch ??
Thanks,
  GerardM

On Thu, 9 Jul 2020 at 15:23, Guillaume Lederrey 
wrote:

> Hello all!
>
> The Search Platform team will join the WIkidata office hours on July 21st
> 16:00 UTC [1]. We are looking forward to discussing Wikidata Query Service
> and anything else you might find of interest.
>
> We've been hard at work on Wikimedia Commons Query Service (WCQS) [2].
> This will be a SPARL endpoint similar to WDQS, but serving the Structured
> Data on Commons dataset. Our goal is to open a beta service, hosted on
> Wikimedia Cloud Service (WMCS) by the end of July. The service will require
> an account on Commons for authentication and will allow federation with
> WDQS. We don't have a streaming update process ready yet, the data will be
> reloaded from Commons dumps weekly for a start.
>
> As part of that work, the dumps for Structured Data on Commons are now
> available [3]. Note that the prefix used in the TTL dumps is "wd", which
> does not make much sense. We are working with WMDE on renaming the
> prefixes, but this is more complex than expected since "wd" is hardcoded in
> more places than it should be. Those prefix should only be valid in the
> local context of the dumps, so renaming them is technically a non breaking
> change. That being said, if you start using those dumps, make sure you
> don't rely on this prefix, or that you are ready for a rename [4].
>
> We are planning to dig more into the data we have to get a better
> understanding of the use cases around WDQS [5] (not much content on that
> task yet, but it is coming). Some very preliminary analysis indicates that
> less then 2% of the queries on WDQS generate more than 90% of the load.
> This is definitely something we need to better understand. We will be
> working on defining the kind of questions we need to answer, and improving
> our data collection to be able to answer those questions.
>
> We have started an internal discussion around "planning for disaster" [6].
> We want to better understand the potential failure scenarios around WDQS
> and have a plan if that worst case does happen. This will include some
> analytics work and some testing to better understand the constraints and
> what degraded mode we might still be able to provide in case of
> catastrophic failure.
>
> Thanks for reading!
>
>Guillaume
>
> [1] https://www.wikidata.org/wiki/Wikidata:Events#Office_hours
> [2] https://phabricator.wikimedia.org/T251488
> [3] https://dumps.wikimedia.org/other/wikibase/commonswiki/
> [4]
> https://dumps.wikimedia.org/other/wikibase/commonswiki/README_commonsrdfdumps.txt
> [5] https://phabricator.wikimedia.org/T257045
> [6] https://phabricator.wikimedia.org/T257055
>
>
> --
> Guillaume Lederrey
> Engineering Manager, Search Platform
> Wikimedia Foundation
> UTC+1 / CET
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] WDQS status

2020-03-04 Thread Thad Guidry
You'll love Flink.  I'd encourage using Apache Beam on top of Flink and use
the unified API.  That way you can take advantage of Java AND Python and Go
(something that will be important for your teams)

https://flink.apache.org/ecosystem/2020/02/22/apache-beam-how-beam-runs-on-top-of-flink.html

Thad
https://www.linkedin.com/in/thadguidry/
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata