Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

2019-07-16 Thread Marc Miquel
eddine Turki >>> -- >>> *De :* Analytics de la part de >>> Houcemeddine A. Turki >>> *Envoyé :* mardi 9 juillet 2019 16:12 >>> *À :* A mailing list for the Analytics Team at WMF and everybody who >>> has an interest

Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

2019-07-15 Thread Dan Andreescu
- >> *De :* Analytics de la part de >> Houcemeddine A. Turki >> *Envoyé :* mardi 9 juillet 2019 16:12 >> *À :* A mailing list for the Analytics Team at WMF and everybody who has >> an interest in Wikipedia and analytics. >> *Objet :* Re: [Analytics] pr

Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

2019-07-09 Thread Marc Miquel
019 16:12 > *À :* A mailing list for the Analytics Team at WMF and everybody who has > an interest in Wikipedia and analytics. > *Objet :* Re: [Analytics] project Cultural Diversity Observatory / > accessing analytics hadoop databases > > Dear Mr., > I thank you f

Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

2019-07-09 Thread Houcemeddine A. Turki
list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Objet : Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases Dear Mr., I thank you for your efforts. When we were in WikiIndaba 2018, it was interesting

Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

2019-07-09 Thread Houcemeddine A. Turki
À : A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Objet : Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases Marc: >We'd like to start the formal process to have an active collaborat

Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

2019-07-09 Thread Nuria Ruiz
Marc: >We'd like to start the formal process to have an active collaboration, as it seems there is no other solution available Given that formal collaborations are somewhat hard to obtain (research team has so many resources) my recommendation would be to import the public data into other

Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

2019-07-09 Thread Marc Miquel
Thanks for your clarification Nuria. The categorylinks table is working better lately. Computing counts at the pagelinks table is critical. I'm afraid there is no solution for this one. I thought about creating a temporary table pagelinks with data from the dumps for each language edition. But

Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

2019-07-08 Thread Nuria Ruiz
>Will there be a release for these two tables? No, sorry, there will not be. The dataset release is about pages and users. To be extra clear though, it is not tables but a denormalized reconstruction of the edit history. > Could I connect to the Hadoop to see if the queries on pagelinks and

Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

2019-07-08 Thread Marc Miquel
Hello Nuria, This seems like an interesting alternative for some data (page, users, revision). It can really help and make some processes faster (at the moment we gave up running again the revision, as the new user_agent change made it also slower). So we will take a look at it as soon as it is

Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

2019-07-08 Thread Nuria Ruiz
Hello, >From your description seems that your problem is not one of computation (well, your main problem) but rather data extraction. The labs replicas are not meant for big data extraction jobs as you have just found out. Neither is Hadoop. Now, our team will be releasing soon a dataset of edit