Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

2019-07-16 Thread Marc Miquel
Thanks for your answer Dan,

I need to retrieve and collect data and do some calculations on them. The
most difficult to extract and operate is revisions, pagelinks and category
links. The approach you are suggesting would be good if I wanted to mine
the text besides the links, but this is not the case. I need to count all
the incoming links to articles and all the incoming links to articles from
a group of articles. The same with outlinks.

I'm going to try to download the pagelinks dump for sql tables (e.g.
https://dumps.wikimedia.org/enwiki/20190701). But I am afraid the time to
download each of them, parse it, transform it to csv, sort the file by
certain columns, etc. would be more costful (time) than doing it in the
original table (replicas) even now the performance is much slower than last
year. In case it doesn't work, I might ping again because I need to
progress on the project I am working on.

Thanks for your help.
Best regards,

Marc Miquel

Missatge de Dan Andreescu  del dia dl., 15 de
jul. 2019 a les 21:55:

> Hi Marc,
>
> To follow up on something Nuria said that may have gotten lost: the xml
> dumps have this information, but you'll have to parse the content.  See the
> example in the docs:
> https://meta.wikimedia.org/wiki/Data_dumps/Dump_format#Format_of_the_XML_files.
> If you expand the "content dump" link, you'll see one of the revisions has
> some wikitext: "un projet de [[dictionnaire]] écrit collectivement".  You
> can parse the majority of links by looking for
> [[all-text-before-a-vertical-bar-is-the-link-itself|link text]].  And you
> can find all templates similarly looking for {template}, categories, etc.
> Of course, if there are nested templates or templates that generate links,
> you won't find those this way.
>
> I think that's what you're investigating right now, but writing just in
> case.  We will eventually publish this data in an easy-to-consume public
> dataset, but we have too many other priorities this year.  Still, if you
> file a task in phabricator, it will help us prioritize it in the future.
>
> And feel free to write back as you're parsing through dumps, any help we
> give you there will help us work on it in the future.  Thanks!
>
> On Tue, Jul 9, 2019 at 11:44 AM Marc Miquel  wrote:
>
>> Thank you Houcemeddine for your answer. At the moment the project is
>> already funded by a Project Grant by the WMF. Nuria had referred to a
>> formal collaboration as a sort of frame to access the Hadoop resource.
>>
>> Thanks Nuria for your recommendation on importing data and computing
>> somewhere else. I will do some tests and estimate the time it might take
>> along with the rest of computations, as it needs to run on a monthly basis.
>> This is something I definitely need to verify.
>>
>> Best regards,
>>
>> Marc
>>
>> Missatge de Houcemeddine A. Turki  del dia
>> dt., 9 de jul. 2019 a les 17:20:
>>
>>> Dear Mr.,
>>> I thank you for your efforts. The link to H2020 is
>>> https://ec.europa.eu/programmes/horizon2020/en/how-get-funding.
>>> Yours Sincerely,
>>> Houcemeddine Turki
>>> --
>>> *De :* Analytics  de la part de
>>> Houcemeddine A. Turki 
>>> *Envoyé :* mardi 9 juillet 2019 16:12
>>> *À :* A mailing list for the Analytics Team at WMF and everybody who
>>> has an interest in Wikipedia and analytics.
>>> *Objet :* Re: [Analytics] project Cultural Diversity Observatory /
>>> accessing analytics hadoop databases
>>>
>>> Dear Mr.,
>>> I thank you for your efforts. When we were in WikiIndaba 2018, it was
>>> interesting to see your research work. The project is interesting
>>> particularly because there are many cultures across the work that are
>>> underrepresented in Internet and mainly Wikipedia. Concerning the formal
>>> collaboration, I think that if your team can apply for a H2020 grant, this
>>> will be useful. This worked for the Scholia project and can work for you as
>>> well.
>>> Yours Sincerely,
>>> Houcemeddine Turki
>>> --
>>> *De :* Analytics  de la part de
>>> Nuria Ruiz 
>>> *Envoyé :* mardi 9 juillet 2019 16:00
>>> *À :* A mailing list for the Analytics Team at WMF and everybody who
>>> has an interest in Wikipedia and analytics.
>>> *Objet :* Re: [Analytics] project Cultural Diversity Observatory /
>>> accessing analytics hadoop databases
>>>
>>> Marc:
>>>
>>> >We'd like to start the formal process to have an active collaboration,
>>> as it seems there is no other

Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

2019-07-15 Thread Dan Andreescu
Hi Marc,

To follow up on something Nuria said that may have gotten lost: the xml
dumps have this information, but you'll have to parse the content.  See the
example in the docs:
https://meta.wikimedia.org/wiki/Data_dumps/Dump_format#Format_of_the_XML_files.
If you expand the "content dump" link, you'll see one of the revisions has
some wikitext: "un projet de [[dictionnaire]] écrit collectivement".  You
can parse the majority of links by looking for
[[all-text-before-a-vertical-bar-is-the-link-itself|link text]].  And you
can find all templates similarly looking for {template}, categories, etc.
Of course, if there are nested templates or templates that generate links,
you won't find those this way.

I think that's what you're investigating right now, but writing just in
case.  We will eventually publish this data in an easy-to-consume public
dataset, but we have too many other priorities this year.  Still, if you
file a task in phabricator, it will help us prioritize it in the future.

And feel free to write back as you're parsing through dumps, any help we
give you there will help us work on it in the future.  Thanks!

On Tue, Jul 9, 2019 at 11:44 AM Marc Miquel  wrote:

> Thank you Houcemeddine for your answer. At the moment the project is
> already funded by a Project Grant by the WMF. Nuria had referred to a
> formal collaboration as a sort of frame to access the Hadoop resource.
>
> Thanks Nuria for your recommendation on importing data and computing
> somewhere else. I will do some tests and estimate the time it might take
> along with the rest of computations, as it needs to run on a monthly basis.
> This is something I definitely need to verify.
>
> Best regards,
>
> Marc
>
> Missatge de Houcemeddine A. Turki  del dia
> dt., 9 de jul. 2019 a les 17:20:
>
>> Dear Mr.,
>> I thank you for your efforts. The link to H2020 is
>> https://ec.europa.eu/programmes/horizon2020/en/how-get-funding.
>> Yours Sincerely,
>> Houcemeddine Turki
>> --
>> *De :* Analytics  de la part de
>> Houcemeddine A. Turki 
>> *Envoyé :* mardi 9 juillet 2019 16:12
>> *À :* A mailing list for the Analytics Team at WMF and everybody who has
>> an interest in Wikipedia and analytics.
>> *Objet :* Re: [Analytics] project Cultural Diversity Observatory /
>> accessing analytics hadoop databases
>>
>> Dear Mr.,
>> I thank you for your efforts. When we were in WikiIndaba 2018, it was
>> interesting to see your research work. The project is interesting
>> particularly because there are many cultures across the work that are
>> underrepresented in Internet and mainly Wikipedia. Concerning the formal
>> collaboration, I think that if your team can apply for a H2020 grant, this
>> will be useful. This worked for the Scholia project and can work for you as
>> well.
>> Yours Sincerely,
>> Houcemeddine Turki
>> --
>> *De :* Analytics  de la part de
>> Nuria Ruiz 
>> *Envoyé :* mardi 9 juillet 2019 16:00
>> *À :* A mailing list for the Analytics Team at WMF and everybody who has
>> an interest in Wikipedia and analytics.
>> *Objet :* Re: [Analytics] project Cultural Diversity Observatory /
>> accessing analytics hadoop databases
>>
>> Marc:
>>
>> >We'd like to start the formal process to have an active collaboration,
>> as it seems there is no other solution available
>>
>> Given that formal collaborations are somewhat hard to obtain (research
>> team has so many resources) my recommendation  would be to import the
>> public data into other computing platform that is not as constrained as
>> labs in terms of space and do your calculations there.
>>
>> Thanks,
>>
>> Nuria
>>
>>
>>
>> On Tue, Jul 9, 2019 at 3:50 AM Marc Miquel  wrote:
>>
>> Thanks for your clarification Nuria.
>>
>> The categorylinks table is working better lately. Computing counts at the
>> pagelinks table is critical. I'm afraid there is no solution for this one.
>>
>> I thought about creating a temporary table pagelinks with data from the
>> dumps for each language edition. But to replicate the pagelinks database in
>> the sever local disk would be so costful in terms of time and space. The
>> magnitude of the enwiki table for pagelinks must be more than 50GB. The
>> entire process would run during many many days considering the other
>> language editions too.
>>
>> Other counts I need to do is the number of editors per article, which
>> also gets stuck with the revision table. For the rest of data, as you said,
>> it is more about retrieval, as you said, and I

Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

2019-07-09 Thread Marc Miquel
Thank you Houcemeddine for your answer. At the moment the project is
already funded by a Project Grant by the WMF. Nuria had referred to a
formal collaboration as a sort of frame to access the Hadoop resource.

Thanks Nuria for your recommendation on importing data and computing
somewhere else. I will do some tests and estimate the time it might take
along with the rest of computations, as it needs to run on a monthly basis.
This is something I definitely need to verify.

Best regards,

Marc

Missatge de Houcemeddine A. Turki  del dia dt.,
9 de jul. 2019 a les 17:20:

> Dear Mr.,
> I thank you for your efforts. The link to H2020 is
> https://ec.europa.eu/programmes/horizon2020/en/how-get-funding.
> Yours Sincerely,
> Houcemeddine Turki
> --
> *De :* Analytics  de la part de
> Houcemeddine A. Turki 
> *Envoyé :* mardi 9 juillet 2019 16:12
> *À :* A mailing list for the Analytics Team at WMF and everybody who has
> an interest in Wikipedia and analytics.
> *Objet :* Re: [Analytics] project Cultural Diversity Observatory /
> accessing analytics hadoop databases
>
> Dear Mr.,
> I thank you for your efforts. When we were in WikiIndaba 2018, it was
> interesting to see your research work. The project is interesting
> particularly because there are many cultures across the work that are
> underrepresented in Internet and mainly Wikipedia. Concerning the formal
> collaboration, I think that if your team can apply for a H2020 grant, this
> will be useful. This worked for the Scholia project and can work for you as
> well.
> Yours Sincerely,
> Houcemeddine Turki
> --
> *De :* Analytics  de la part de
> Nuria Ruiz 
> *Envoyé :* mardi 9 juillet 2019 16:00
> *À :* A mailing list for the Analytics Team at WMF and everybody who has
> an interest in Wikipedia and analytics.
> *Objet :* Re: [Analytics] project Cultural Diversity Observatory /
> accessing analytics hadoop databases
>
> Marc:
>
> >We'd like to start the formal process to have an active collaboration, as
> it seems there is no other solution available
>
> Given that formal collaborations are somewhat hard to obtain (research
> team has so many resources) my recommendation  would be to import the
> public data into other computing platform that is not as constrained as
> labs in terms of space and do your calculations there.
>
> Thanks,
>
> Nuria
>
>
>
> On Tue, Jul 9, 2019 at 3:50 AM Marc Miquel  wrote:
>
> Thanks for your clarification Nuria.
>
> The categorylinks table is working better lately. Computing counts at the
> pagelinks table is critical. I'm afraid there is no solution for this one.
>
> I thought about creating a temporary table pagelinks with data from the
> dumps for each language edition. But to replicate the pagelinks database in
> the sever local disk would be so costful in terms of time and space. The
> magnitude of the enwiki table for pagelinks must be more than 50GB. The
> entire process would run during many many days considering the other
> language editions too.
>
> Other counts I need to do is the number of editors per article, which also
> gets stuck with the revision table. For the rest of data, as you said, it
> is more about retrieval, as you said, and I can use alternatives.
>
> The queries to obtain count for pagelinks is something that worked before
> with the database replicas and a database with more power like Hadoop would
> do with certain ease. The problem is both a mixture of retrieval but also
> computing power.
>
> We'd like to start the formal process to have an active collaboration, as
> it seems there is no other solution available and we cannot be stuck and
> not deliver the work promised. I'll let you know when I have more info.
>
> Thanks again.
> Best,
>
> Marc Miquel
>
>
> Missatge de Nuria Ruiz  del dia dt., 9 de jul. 2019
> a les 1:44:
>
> >Will there be a release for these two tables?
> No, sorry, there will not be. The dataset release is about pages and
> users. To be extra clear though, it is not tables but a denormalized
> reconstruction of the edit history.
>
> > Could I connect to the Hadoop to see if the queries on pagelinks and
> categorylinks run faster?
> It is a bit more complicated that just "connecting"  but I do not think we
> have to dwell on that, cause, as far as I know, there is no categorylink
> info in hadoop.
>
>
> Hadoop has the set of data from mediawiki that we use to create the
> dataset I pointed you to:
> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
>  and
> a bit more.
>
> Is it possible to extract some of this information from the xml dumps?
> Perhaps som

Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

2019-07-09 Thread Houcemeddine A. Turki
Dear Mr.,
I thank you for your efforts. The link to H2020 is 
https://ec.europa.eu/programmes/horizon2020/en/how-get-funding.
Yours Sincerely,
Houcemeddine Turki

De : Analytics  de la part de 
Houcemeddine A. Turki 
Envoyé : mardi 9 juillet 2019 16:12
À : A mailing list for the Analytics Team at WMF and everybody who has an 
interest in Wikipedia and analytics.
Objet : Re: [Analytics] project Cultural Diversity Observatory / accessing 
analytics hadoop databases

Dear Mr.,
I thank you for your efforts. When we were in WikiIndaba 2018, it was 
interesting to see your research work. The project is interesting particularly 
because there are many cultures across the work that are underrepresented in 
Internet and mainly Wikipedia. Concerning the formal collaboration, I think 
that if your team can apply for a H2020 grant, this will be useful. This worked 
for the Scholia project and can work for you as well.
Yours Sincerely,
Houcemeddine Turki

De : Analytics  de la part de Nuria Ruiz 

Envoyé : mardi 9 juillet 2019 16:00
À : A mailing list for the Analytics Team at WMF and everybody who has an 
interest in Wikipedia and analytics.
Objet : Re: [Analytics] project Cultural Diversity Observatory / accessing 
analytics hadoop databases

Marc:

>We'd like to start the formal process to have an active collaboration, as it 
>seems there is no other solution available

Given that formal collaborations are somewhat hard to obtain (research team has 
so many resources) my recommendation  would be to import the public data into 
other computing platform that is not as constrained as labs in terms of space 
and do your calculations there.

Thanks,

Nuria



On Tue, Jul 9, 2019 at 3:50 AM Marc Miquel 
mailto:marcmiq...@gmail.com>> wrote:
Thanks for your clarification Nuria.

The categorylinks table is working better lately. Computing counts at the 
pagelinks table is critical. I'm afraid there is no solution for this one.

I thought about creating a temporary table pagelinks with data from the dumps 
for each language edition. But to replicate the pagelinks database in the sever 
local disk would be so costful in terms of time and space. The magnitude of the 
enwiki table for pagelinks must be more than 50GB. The entire process would run 
during many many days considering the other language editions too.

Other counts I need to do is the number of editors per article, which also gets 
stuck with the revision table. For the rest of data, as you said, it is more 
about retrieval, as you said, and I can use alternatives.

The queries to obtain count for pagelinks is something that worked before with 
the database replicas and a database with more power like Hadoop would do with 
certain ease. The problem is both a mixture of retrieval but also computing 
power.

We'd like to start the formal process to have an active collaboration, as it 
seems there is no other solution available and we cannot be stuck and not 
deliver the work promised. I'll let you know when I have more info.

Thanks again.
Best,

Marc Miquel


Missatge de Nuria Ruiz mailto:nu...@wikimedia.org>> del 
dia dt., 9 de jul. 2019 a les 1:44:
>Will there be a release for these two tables?
No, sorry, there will not be. The dataset release is about pages and users. To 
be extra clear though, it is not tables but a denormalized reconstruction of 
the edit history.

> Could I connect to the Hadoop to see if the queries on pagelinks and 
> categorylinks run faster?
It is a bit more complicated that just "connecting"  but I do not think we have 
to dwell on that, cause, as far as I know, there is no categorylink info in 
hadoop.

Hadoop has the set of data from mediawiki that we use to create the dataset I 
pointed you to: 
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history 
and a bit more.

Is it possible to extract some of this information from the xml dumps?  Perhaps 
somebody in the list has other ideas?

Thanks,

Nuria

P.S. So you know in order to facilitate access to our computing resources and 
private data (there is no way for us to give access to only "part" of the data 
we hold in hadoop)  we require an active collaboration with our research team. 
We cannot support ad-hoc access to hadoop for community members.
Here is some info:
https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations






On Mon, Jul 8, 2019 at 4:14 PM Marc Miquel 
mailto:marcmiq...@gmail.com>> wrote:
Hello Nuria,

This seems like an interesting alternative for some data (page, users, 
revision). It can really help and make some processes faster (at the moment we 
gave up running again the revision, as the new user_agent change made it also 
slower). So we will take a look at it as soon as it is ready.

However, the scripts are struggling with other tables: pagelinks and category 
graph.

For instance, we need to count the p

Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

2019-07-09 Thread Houcemeddine A. Turki
Dear Mr.,
I thank you for your efforts. When we were in WikiIndaba 2018, it was 
interesting to see your research work. The project is interesting particularly 
because there are many cultures across the work that are underrepresented in 
Internet and mainly Wikipedia. Concerning the formal collaboration, I think 
that if your team can apply for a H2020 grant, this will be useful. This worked 
for the Scholia project and can work for you as well.
Yours Sincerely,
Houcemeddine Turki

De : Analytics  de la part de Nuria Ruiz 

Envoyé : mardi 9 juillet 2019 16:00
À : A mailing list for the Analytics Team at WMF and everybody who has an 
interest in Wikipedia and analytics.
Objet : Re: [Analytics] project Cultural Diversity Observatory / accessing 
analytics hadoop databases

Marc:

>We'd like to start the formal process to have an active collaboration, as it 
>seems there is no other solution available

Given that formal collaborations are somewhat hard to obtain (research team has 
so many resources) my recommendation  would be to import the public data into 
other computing platform that is not as constrained as labs in terms of space 
and do your calculations there.

Thanks,

Nuria



On Tue, Jul 9, 2019 at 3:50 AM Marc Miquel 
mailto:marcmiq...@gmail.com>> wrote:
Thanks for your clarification Nuria.

The categorylinks table is working better lately. Computing counts at the 
pagelinks table is critical. I'm afraid there is no solution for this one.

I thought about creating a temporary table pagelinks with data from the dumps 
for each language edition. But to replicate the pagelinks database in the sever 
local disk would be so costful in terms of time and space. The magnitude of the 
enwiki table for pagelinks must be more than 50GB. The entire process would run 
during many many days considering the other language editions too.

Other counts I need to do is the number of editors per article, which also gets 
stuck with the revision table. For the rest of data, as you said, it is more 
about retrieval, as you said, and I can use alternatives.

The queries to obtain count for pagelinks is something that worked before with 
the database replicas and a database with more power like Hadoop would do with 
certain ease. The problem is both a mixture of retrieval but also computing 
power.

We'd like to start the formal process to have an active collaboration, as it 
seems there is no other solution available and we cannot be stuck and not 
deliver the work promised. I'll let you know when I have more info.

Thanks again.
Best,

Marc Miquel


Missatge de Nuria Ruiz mailto:nu...@wikimedia.org>> del 
dia dt., 9 de jul. 2019 a les 1:44:
>Will there be a release for these two tables?
No, sorry, there will not be. The dataset release is about pages and users. To 
be extra clear though, it is not tables but a denormalized reconstruction of 
the edit history.

> Could I connect to the Hadoop to see if the queries on pagelinks and 
> categorylinks run faster?
It is a bit more complicated that just "connecting"  but I do not think we have 
to dwell on that, cause, as far as I know, there is no categorylink info in 
hadoop.

Hadoop has the set of data from mediawiki that we use to create the dataset I 
pointed you to: 
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history 
and a bit more.

Is it possible to extract some of this information from the xml dumps?  Perhaps 
somebody in the list has other ideas?

Thanks,

Nuria

P.S. So you know in order to facilitate access to our computing resources and 
private data (there is no way for us to give access to only "part" of the data 
we hold in hadoop)  we require an active collaboration with our research team. 
We cannot support ad-hoc access to hadoop for community members.
Here is some info:
https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations






On Mon, Jul 8, 2019 at 4:14 PM Marc Miquel 
mailto:marcmiq...@gmail.com>> wrote:
Hello Nuria,

This seems like an interesting alternative for some data (page, users, 
revision). It can really help and make some processes faster (at the moment we 
gave up running again the revision, as the new user_agent change made it also 
slower). So we will take a look at it as soon as it is ready.

However, the scripts are struggling with other tables: pagelinks and category 
graph.

For instance, we need to count the percentage of links an article directs to 
other pages or the percentage of links it receives from a group of pages. 
Likewise, we need to run down the category graph starting from a specific group 
of categories. At the moment, the query that uses pagelinks is not really 
working when counting when passing parameters for the entire table or for 
specific parts (using batches).

Will there be a release for these two tables? Could I connect to the Hadoop to 
see if the queries on pagelinks

Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

2019-07-09 Thread Nuria Ruiz
Marc:

>We'd like to start the formal process to have an active collaboration, as
it seems there is no other solution available

Given that formal collaborations are somewhat hard to obtain (research team
has so many resources) my recommendation  would be to import the public
data into other computing platform that is not as constrained as labs in
terms of space and do your calculations there.

Thanks,

Nuria



On Tue, Jul 9, 2019 at 3:50 AM Marc Miquel  wrote:

> Thanks for your clarification Nuria.
>
> The categorylinks table is working better lately. Computing counts at the
> pagelinks table is critical. I'm afraid there is no solution for this one.
>
> I thought about creating a temporary table pagelinks with data from the
> dumps for each language edition. But to replicate the pagelinks database in
> the sever local disk would be so costful in terms of time and space. The
> magnitude of the enwiki table for pagelinks must be more than 50GB. The
> entire process would run during many many days considering the other
> language editions too.
>
> Other counts I need to do is the number of editors per article, which also
> gets stuck with the revision table. For the rest of data, as you said, it
> is more about retrieval, as you said, and I can use alternatives.
>
> The queries to obtain count for pagelinks is something that worked before
> with the database replicas and a database with more power like Hadoop would
> do with certain ease. The problem is both a mixture of retrieval but also
> computing power.
>
> We'd like to start the formal process to have an active collaboration, as
> it seems there is no other solution available and we cannot be stuck and
> not deliver the work promised. I'll let you know when I have more info.
>
> Thanks again.
> Best,
>
> Marc Miquel
>
>
> Missatge de Nuria Ruiz  del dia dt., 9 de jul. 2019
> a les 1:44:
>
>> >Will there be a release for these two tables?
>> No, sorry, there will not be. The dataset release is about pages and
>> users. To be extra clear though, it is not tables but a denormalized
>> reconstruction of the edit history.
>>
>> > Could I connect to the Hadoop to see if the queries on pagelinks and
>> categorylinks run faster?
>> It is a bit more complicated that just "connecting"  but I do not think
>> we have to dwell on that, cause, as far as I know, there is no categorylink
>> info in hadoop.
>>
>
>> Hadoop has the set of data from mediawiki that we use to create the
>> dataset I pointed you to:
>> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
>>  and
>> a bit more.
>>
>> Is it possible to extract some of this information from the xml dumps?
>> Perhaps somebody in the list has other ideas?
>>
>> Thanks,
>>
>> Nuria
>>
>> P.S. So you know in order to facilitate access to our computing resources
>> and private data (there is no way for us to give access to only "part" of
>> the data we hold in hadoop)  we require an active collaboration with our
>> research team. We cannot support ad-hoc access to hadoop for community
>> members.
>> Here is some info:
>> https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations
>>
>>
>>
>>
>>
>>
>> On Mon, Jul 8, 2019 at 4:14 PM Marc Miquel  wrote:
>>
>>> Hello Nuria,
>>>
>>> This seems like an interesting alternative for some data (page, users,
>>> revision). It can really help and make some processes faster (at the moment
>>> we gave up running again the revision, as the new user_agent change made it
>>> also slower). So we will take a look at it as soon as it is ready.
>>>
>>> However, the scripts are struggling with other tables: pagelinks and
>>> category graph.
>>>
>>> For instance, we need to count the percentage of links an article
>>> directs to other pages or the percentage of links it receives from a group
>>> of pages. Likewise, we need to run down the category graph starting from a
>>> specific group of categories. At the moment, the query that uses pagelinks
>>> is not really working when counting when passing parameters for the entire
>>> table or for specific parts (using batches).
>>>
>>> Will there be a release for these two tables? Could I connect to the
>>> Hadoop to see if the queries on pagelinks and categorylinks run faster?
>>>
>>> If there is any other alternative we'd be happy to try as we cannot
>>> progress for several weeks.
>>> Thanks again,
>>>
>>> Marc
>>>
>>> Missatge de Nuria Ruiz  del dia dt., 9 de jul.
>>> 2019 a les 0:56:
>>>
 Hello,

 From your description seems that your problem is not one of computation
 (well,  your main problem) but rather data extraction. The labs replicas
 are not meant for big data extraction jobs as you have just found out.
 Neither is Hadoop. Now, our team will be releasing soon a dataset of edit
 denormalized data that you can probably use, it is still up for discussion
 whether the data will be released as a JSON dump or other but basically is
 a denormalized version of 

Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

2019-07-09 Thread Marc Miquel
Thanks for your clarification Nuria.

The categorylinks table is working better lately. Computing counts at the
pagelinks table is critical. I'm afraid there is no solution for this one.

I thought about creating a temporary table pagelinks with data from the
dumps for each language edition. But to replicate the pagelinks database in
the sever local disk would be so costful in terms of time and space. The
magnitude of the enwiki table for pagelinks must be more than 50GB. The
entire process would run during many many days considering the other
language editions too.

Other counts I need to do is the number of editors per article, which also
gets stuck with the revision table. For the rest of data, as you said, it
is more about retrieval, as you said, and I can use alternatives.

The queries to obtain count for pagelinks is something that worked before
with the database replicas and a database with more power like Hadoop would
do with certain ease. The problem is both a mixture of retrieval but also
computing power.

We'd like to start the formal process to have an active collaboration, as
it seems there is no other solution available and we cannot be stuck and
not deliver the work promised. I'll let you know when I have more info.

Thanks again.
Best,

Marc Miquel


Missatge de Nuria Ruiz  del dia dt., 9 de jul. 2019 a
les 1:44:

> >Will there be a release for these two tables?
> No, sorry, there will not be. The dataset release is about pages and
> users. To be extra clear though, it is not tables but a denormalized
> reconstruction of the edit history.
>
> > Could I connect to the Hadoop to see if the queries on pagelinks and
> categorylinks run faster?
> It is a bit more complicated that just "connecting"  but I do not think we
> have to dwell on that, cause, as far as I know, there is no categorylink
> info in hadoop.
>

> Hadoop has the set of data from mediawiki that we use to create the
> dataset I pointed you to:
> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
>  and
> a bit more.
>
> Is it possible to extract some of this information from the xml dumps?
> Perhaps somebody in the list has other ideas?
>
> Thanks,
>
> Nuria
>
> P.S. So you know in order to facilitate access to our computing resources
> and private data (there is no way for us to give access to only "part" of
> the data we hold in hadoop)  we require an active collaboration with our
> research team. We cannot support ad-hoc access to hadoop for community
> members.
> Here is some info:
> https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations
>
>
>
>
>
>
> On Mon, Jul 8, 2019 at 4:14 PM Marc Miquel  wrote:
>
>> Hello Nuria,
>>
>> This seems like an interesting alternative for some data (page, users,
>> revision). It can really help and make some processes faster (at the moment
>> we gave up running again the revision, as the new user_agent change made it
>> also slower). So we will take a look at it as soon as it is ready.
>>
>> However, the scripts are struggling with other tables: pagelinks and
>> category graph.
>>
>> For instance, we need to count the percentage of links an article directs
>> to other pages or the percentage of links it receives from a group of
>> pages. Likewise, we need to run down the category graph starting from a
>> specific group of categories. At the moment, the query that uses pagelinks
>> is not really working when counting when passing parameters for the entire
>> table or for specific parts (using batches).
>>
>> Will there be a release for these two tables? Could I connect to the
>> Hadoop to see if the queries on pagelinks and categorylinks run faster?
>>
>> If there is any other alternative we'd be happy to try as we cannot
>> progress for several weeks.
>> Thanks again,
>>
>> Marc
>>
>> Missatge de Nuria Ruiz  del dia dt., 9 de jul. 2019
>> a les 0:56:
>>
>>> Hello,
>>>
>>> From your description seems that your problem is not one of computation
>>> (well,  your main problem) but rather data extraction. The labs replicas
>>> are not meant for big data extraction jobs as you have just found out.
>>> Neither is Hadoop. Now, our team will be releasing soon a dataset of edit
>>> denormalized data that you can probably use, it is still up for discussion
>>> whether the data will be released as a JSON dump or other but basically is
>>> a denormalized version of all the data held in the replicas that will be
>>> created monthly.
>>>
>>> Please take a look at the documentation of the dataset:
>>>
>>> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
>>>
>>> This is the phab ticket:
>>> https://phabricator.wikimedia.org/T208612
>>>
>>> So, to sum up, once this dataset is out (we hope late this quarter or
>>> early next) you can probably build your own datasets from it thus rendering
>>> your usage of the replicas obsolete. Hopefully this makes sense.
>>>
>>> Thanks,
>>>
>>> Nuria
>>>
>>>
>>>
>>>
>>> On Mon, Jul 8, 2019 at 3:34 

Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

2019-07-08 Thread Nuria Ruiz
>Will there be a release for these two tables?
No, sorry, there will not be. The dataset release is about pages and users.
To be extra clear though, it is not tables but a denormalized
reconstruction of the edit history.

> Could I connect to the Hadoop to see if the queries on pagelinks and
categorylinks run faster?
It is a bit more complicated that just "connecting"  but I do not think we
have to dwell on that, cause, as far as I know, there is no categorylink
info in hadoop.

Hadoop has the set of data from mediawiki that we use to create the dataset
I pointed you to:
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
and
a bit more.

Is it possible to extract some of this information from the xml dumps?
Perhaps somebody in the list has other ideas?

Thanks,

Nuria

P.S. So you know in order to facilitate access to our computing resources
and private data (there is no way for us to give access to only "part" of
the data we hold in hadoop)  we require an active collaboration with our
research team. We cannot support ad-hoc access to hadoop for community
members.
Here is some info:
https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations






On Mon, Jul 8, 2019 at 4:14 PM Marc Miquel  wrote:

> Hello Nuria,
>
> This seems like an interesting alternative for some data (page, users,
> revision). It can really help and make some processes faster (at the moment
> we gave up running again the revision, as the new user_agent change made it
> also slower). So we will take a look at it as soon as it is ready.
>
> However, the scripts are struggling with other tables: pagelinks and
> category graph.
>
> For instance, we need to count the percentage of links an article directs
> to other pages or the percentage of links it receives from a group of
> pages. Likewise, we need to run down the category graph starting from a
> specific group of categories. At the moment, the query that uses pagelinks
> is not really working when counting when passing parameters for the entire
> table or for specific parts (using batches).
>
> Will there be a release for these two tables? Could I connect to the
> Hadoop to see if the queries on pagelinks and categorylinks run faster?
>
> If there is any other alternative we'd be happy to try as we cannot
> progress for several weeks.
> Thanks again,
>
> Marc
>
> Missatge de Nuria Ruiz  del dia dt., 9 de jul. 2019
> a les 0:56:
>
>> Hello,
>>
>> From your description seems that your problem is not one of computation
>> (well,  your main problem) but rather data extraction. The labs replicas
>> are not meant for big data extraction jobs as you have just found out.
>> Neither is Hadoop. Now, our team will be releasing soon a dataset of edit
>> denormalized data that you can probably use, it is still up for discussion
>> whether the data will be released as a JSON dump or other but basically is
>> a denormalized version of all the data held in the replicas that will be
>> created monthly.
>>
>> Please take a look at the documentation of the dataset:
>>
>> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
>>
>> This is the phab ticket:
>> https://phabricator.wikimedia.org/T208612
>>
>> So, to sum up, once this dataset is out (we hope late this quarter or
>> early next) you can probably build your own datasets from it thus rendering
>> your usage of the replicas obsolete. Hopefully this makes sense.
>>
>> Thanks,
>>
>> Nuria
>>
>>
>>
>>
>> On Mon, Jul 8, 2019 at 3:34 PM Marc Miquel  wrote:
>>
>>> To whom it might concern,
>>>
>>> I am writing in regards of the project *Cultural Diversity Observatory*
>>> and the data we are collecting. In short, this project aims at bridging the
>>> content gaps between language editions that relate to cultural and
>>> geographical aspects. For this we need to retrieve data from all language
>>> editions and Wikidata, and run some scripts in order to crawl down the
>>> category and the link graph, in order to create some datasets and
>>> statistics.
>>>
>>> The reason that I am writing is because we are stuck as we cannot
>>> automatize the scripts to retrieve data from the Replicas. We could create
>>> the datasets few months ago but during the past months it is impossible.
>>>
>>> We are concerned because one thing is to create the dataset once for
>>> research purposes and another thing is to create them on monthly basis.
>>> This is what we promised in the project grant
>>> 
>>> details and now we cannot do it because of the infrastructure. It is
>>> important to do it on monthly basis because the data visualizations and
>>> statistics Wikipedia communities will receive need to be updated.
>>>
>>> Lately there had been some changes in the Replicas databases and the
>>> queries that used to take several hours are getting stuck completely. We
>>> tried to code them in multiple ways: a) using complex 

Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

2019-07-08 Thread Marc Miquel
Hello Nuria,

This seems like an interesting alternative for some data (page, users,
revision). It can really help and make some processes faster (at the moment
we gave up running again the revision, as the new user_agent change made it
also slower). So we will take a look at it as soon as it is ready.

However, the scripts are struggling with other tables: pagelinks and
category graph.

For instance, we need to count the percentage of links an article directs
to other pages or the percentage of links it receives from a group of
pages. Likewise, we need to run down the category graph starting from a
specific group of categories. At the moment, the query that uses pagelinks
is not really working when counting when passing parameters for the entire
table or for specific parts (using batches).

Will there be a release for these two tables? Could I connect to the Hadoop
to see if the queries on pagelinks and categorylinks run faster?

If there is any other alternative we'd be happy to try as we cannot
progress for several weeks.
Thanks again,

Marc

Missatge de Nuria Ruiz  del dia dt., 9 de jul. 2019 a
les 0:56:

> Hello,
>
> From your description seems that your problem is not one of computation
> (well,  your main problem) but rather data extraction. The labs replicas
> are not meant for big data extraction jobs as you have just found out.
> Neither is Hadoop. Now, our team will be releasing soon a dataset of edit
> denormalized data that you can probably use, it is still up for discussion
> whether the data will be released as a JSON dump or other but basically is
> a denormalized version of all the data held in the replicas that will be
> created monthly.
>
> Please take a look at the documentation of the dataset:
>
> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
>
> This is the phab ticket:
> https://phabricator.wikimedia.org/T208612
>
> So, to sum up, once this dataset is out (we hope late this quarter or
> early next) you can probably build your own datasets from it thus rendering
> your usage of the replicas obsolete. Hopefully this makes sense.
>
> Thanks,
>
> Nuria
>
>
>
>
> On Mon, Jul 8, 2019 at 3:34 PM Marc Miquel  wrote:
>
>> To whom it might concern,
>>
>> I am writing in regards of the project *Cultural Diversity Observatory*
>> and the data we are collecting. In short, this project aims at bridging the
>> content gaps between language editions that relate to cultural and
>> geographical aspects. For this we need to retrieve data from all language
>> editions and Wikidata, and run some scripts in order to crawl down the
>> category and the link graph, in order to create some datasets and
>> statistics.
>>
>> The reason that I am writing is because we are stuck as we cannot
>> automatize the scripts to retrieve data from the Replicas. We could create
>> the datasets few months ago but during the past months it is impossible.
>>
>> We are concerned because one thing is to create the dataset once for
>> research purposes and another thing is to create them on monthly basis.
>> This is what we promised in the project grant
>> 
>> details and now we cannot do it because of the infrastructure. It is
>> important to do it on monthly basis because the data visualizations and
>> statistics Wikipedia communities will receive need to be updated.
>>
>> Lately there had been some changes in the Replicas databases and the
>> queries that used to take several hours are getting stuck completely. We
>> tried to code them in multiple ways: a) using complex queries, b) doing the
>> joins as code logics and in-memory, c) downloading the parts of the table
>> that we require and storing them in a local database. *None is an option
>> now *considering the current performance of the replicas.
>>
>> Bryan Davis suggested that this might be a moment to consult the
>> Analytics team, considering the Hadoop environemnt is design to run long,
>> complex queries and it has massively more compute power than the Wiki
>> Replicas cluster. We would certainly be relieved If you considerd we could
>> connect to these Analytics databases (Hadoop).
>>
>> Let us know if you need more information on the specific queries or the
>> processes we are running. The server we are using is wcdo.eqiad.wmflabs. We
>> will be happy to explain in detail anything you require.
>>
>> Thanks.
>> Best regards,
>>
>> Marc Miquel
>>
>> PS: You can read about the method we follow to retrieve data and create
>> the dataset here:
>>
>> *Miquel-Ribé, M., & Laniado, D. (2019). Wikipedia Cultural Diversity
>> Dataset: A Complete Cartography for 300 Language Editions. Proceedings of
>> the 13th International AAAI Conference on Web and Social Media. ICWSM. ACM.
>> 2334-0770 *
>> wvvw.aaai.org/ojs/index.php/ICWSM/article/download/3260/3128/
>> ___
>> Analytics mailing list
>> 

Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

2019-07-08 Thread Nuria Ruiz
Hello,

>From your description seems that your problem is not one of computation
(well,  your main problem) but rather data extraction. The labs replicas
are not meant for big data extraction jobs as you have just found out.
Neither is Hadoop. Now, our team will be releasing soon a dataset of edit
denormalized data that you can probably use, it is still up for discussion
whether the data will be released as a JSON dump or other but basically is
a denormalized version of all the data held in the replicas that will be
created monthly.

Please take a look at the documentation of the dataset:
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history

This is the phab ticket:
https://phabricator.wikimedia.org/T208612

So, to sum up, once this dataset is out (we hope late this quarter or early
next) you can probably build your own datasets from it thus rendering your
usage of the replicas obsolete. Hopefully this makes sense.

Thanks,

Nuria




On Mon, Jul 8, 2019 at 3:34 PM Marc Miquel  wrote:

> To whom it might concern,
>
> I am writing in regards of the project *Cultural Diversity Observatory*
> and the data we are collecting. In short, this project aims at bridging the
> content gaps between language editions that relate to cultural and
> geographical aspects. For this we need to retrieve data from all language
> editions and Wikidata, and run some scripts in order to crawl down the
> category and the link graph, in order to create some datasets and
> statistics.
>
> The reason that I am writing is because we are stuck as we cannot
> automatize the scripts to retrieve data from the Replicas. We could create
> the datasets few months ago but during the past months it is impossible.
>
> We are concerned because one thing is to create the dataset once for
> research purposes and another thing is to create them on monthly basis.
> This is what we promised in the project grant
> 
> details and now we cannot do it because of the infrastructure. It is
> important to do it on monthly basis because the data visualizations and
> statistics Wikipedia communities will receive need to be updated.
>
> Lately there had been some changes in the Replicas databases and the
> queries that used to take several hours are getting stuck completely. We
> tried to code them in multiple ways: a) using complex queries, b) doing the
> joins as code logics and in-memory, c) downloading the parts of the table
> that we require and storing them in a local database. *None is an option
> now *considering the current performance of the replicas.
>
> Bryan Davis suggested that this might be a moment to consult the Analytics
> team, considering the Hadoop environemnt is design to run long, complex
> queries and it has massively more compute power than the Wiki Replicas
> cluster. We would certainly be relieved If you considerd we could connect
> to these Analytics databases (Hadoop).
>
> Let us know if you need more information on the specific queries or the
> processes we are running. The server we are using is wcdo.eqiad.wmflabs. We
> will be happy to explain in detail anything you require.
>
> Thanks.
> Best regards,
>
> Marc Miquel
>
> PS: You can read about the method we follow to retrieve data and create
> the dataset here:
>
> *Miquel-Ribé, M., & Laniado, D. (2019). Wikipedia Cultural Diversity
> Dataset: A Complete Cartography for 300 Language Editions. Proceedings of
> the 13th International AAAI Conference on Web and Social Media. ICWSM. ACM.
> 2334-0770 *
> wvvw.aaai.org/ojs/index.php/ICWSM/article/download/3260/3128/
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics