Re: [Wiki-research-l] category extraction question

2017-07-10 Thread Bowen Yu
Hi Leila,

I did something similar before. I was trying to create "top-level" category
labels for the articles, like history, society, technology, etc. I parsed
the wikitext in dump data to extract all the sub category labels of the
article. Also, by parsing pages of namespace 14, I created a
category-relation graph for all the category labels, where ideally, each
sub category can reach some "top-level" category. Then, for each article,
you can take the sub category label into the graph for the top-level
categories. More detail can be found in 3.3.2 Independent Variables -
Identity-based Attachment subsection in the paper. Hope it helps!

On Mon, Jul 10, 2017 at 8:45 PM, Stuart A. Yeates  wrote:

> The category system on en.wiki is not an IS-A system and there have been
> several discussions about making it it based on mathematical principals
> which have come to nothing because the consensus of editors is against it.
> The best way to think about categories is as a locally-faceted related
> links system.
>
> Having said that, Category:Wikipedia maintenance is an important root
> probably useful for separating  the wheat from the chaff. Most of these are
> also hidden categories. I'm not sure whether this flag appears in the SQL,
> but see
> https://en.wikipedia.org/wiki/Wikipedia:Categorization#Hiding_categories
>
> cheers
> stuart
>
> --
> ...let us be heard from red core to black sky
>
> On 11 July 2017 at 13:20, Leila Zia  wrote:
>
> > Hi all,
> >
> > [If you are not interested in discussions related to the category system
> > ​ (on English Wikipedia)​
> > , you can stop here. :)]
> >
> > We have run into a problem that some of you may have thought about or
> > addressed before. We are trying to clean up the category system on
> English
> > Wikipedia by turning the category structure to an IS-A hierarchy. (The
> > output of this work can be useful for the research on template
> > recommendation [1], for example, but the use-cases won't stop there). One
> > issue that we are facing is the following:
> >
> > We are currently
> > ​using
> >  SQL dumps to extract categories associated with every article on English
> > Wikipedia (main namespace). [2]
> > ​ Using this approach, we get 5 categories associated with Flow cytometry
> > bioinformatics article [3]:
> >
> > Flow_cytometry
> > Bioinformatics
> >
> > Wikipedia_articles_published_in_peer-reviewed_literature
> > Wikipedia_articles_published_in_PLOS_Computational_Biology
> > CS1_maint:_Multiple_names:_authors_list
> >
> > ​The problem is that only the first two categories are the ones we are
> > interested in. We have one cleaning step through which we only keep
> > categories that belong to category Article and that step removes the last
> > category above, but the other two Wikipedia_... remain there. We need to
> > somehow prune the data and clean it from those two categories.
> >
> > One way we could do the above would be to parse wikitext instead of the
> SQL
> > dumps and focus on extracting categories marked by pattern
> [[Category:XX]],
> > but in that case, we would lose a good category such as
> > Guided_missiles_of_Norway​
> > ​ because that's generated by a template.​
> >
> > Any ideas on how we can start with a "cleaner" dataset of categories
> > related to the topic of the articles as opposed to maintenance related or
> > other types of categories?
> >
> > Thanks,
> > Leila
> >
> > [1] https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia
> > _stubs_across_languages
> >
> > [2] The exact code we use is
> >
> > SELECT p.page_id id, p.page_title title, cl.cl_to category
> > FROM categorylinks cl
> > JOIN page p
> > on cl.cl_from = p.page_id
> > where cl_type = 'page'
> > and page_namespace = 0
> > and page_is_redirect = 0
> >
> > ​and the edges of the category graph are extracted with
> >
> > *SELECT p.page_title category, cl.cl_to parent *
> > *FROM categorylinks cl *
> > *JOIN page p *
> > *ON p.page_id = cl.cl_from *
> > *where p.page_namespace = 14*​
> >
> >
> > ​[3] https://en.wikipedia.org/wiki/Flow_cytometry_bioinformatics​
> > ___
> > Wiki-research-l mailing list
> > Wiki-research-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] category extraction question

2017-07-10 Thread Stuart A. Yeates
The category system on en.wiki is not an IS-A system and there have been
several discussions about making it it based on mathematical principals
which have come to nothing because the consensus of editors is against it.
The best way to think about categories is as a locally-faceted related
links system.

Having said that, Category:Wikipedia maintenance is an important root
probably useful for separating  the wheat from the chaff. Most of these are
also hidden categories. I'm not sure whether this flag appears in the SQL,
but see
https://en.wikipedia.org/wiki/Wikipedia:Categorization#Hiding_categories

cheers
stuart

--
...let us be heard from red core to black sky

On 11 July 2017 at 13:20, Leila Zia  wrote:

> Hi all,
>
> [If you are not interested in discussions related to the category system
> ​ (on English Wikipedia)​
> , you can stop here. :)]
>
> We have run into a problem that some of you may have thought about or
> addressed before. We are trying to clean up the category system on English
> Wikipedia by turning the category structure to an IS-A hierarchy. (The
> output of this work can be useful for the research on template
> recommendation [1], for example, but the use-cases won't stop there). One
> issue that we are facing is the following:
>
> We are currently
> ​using
>  SQL dumps to extract categories associated with every article on English
> Wikipedia (main namespace). [2]
> ​ Using this approach, we get 5 categories associated with Flow cytometry
> bioinformatics article [3]:
>
> Flow_cytometry
> Bioinformatics
>
> Wikipedia_articles_published_in_peer-reviewed_literature
> Wikipedia_articles_published_in_PLOS_Computational_Biology
> CS1_maint:_Multiple_names:_authors_list
>
> ​The problem is that only the first two categories are the ones we are
> interested in. We have one cleaning step through which we only keep
> categories that belong to category Article and that step removes the last
> category above, but the other two Wikipedia_... remain there. We need to
> somehow prune the data and clean it from those two categories.
>
> One way we could do the above would be to parse wikitext instead of the SQL
> dumps and focus on extracting categories marked by pattern [[Category:XX]],
> but in that case, we would lose a good category such as
> Guided_missiles_of_Norway​
> ​ because that's generated by a template.​
>
> Any ideas on how we can start with a "cleaner" dataset of categories
> related to the topic of the articles as opposed to maintenance related or
> other types of categories?
>
> Thanks,
> Leila
>
> [1] https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia
> _stubs_across_languages
>
> [2] The exact code we use is
>
> SELECT p.page_id id, p.page_title title, cl.cl_to category
> FROM categorylinks cl
> JOIN page p
> on cl.cl_from = p.page_id
> where cl_type = 'page'
> and page_namespace = 0
> and page_is_redirect = 0
>
> ​and the edges of the category graph are extracted with
>
> *SELECT p.page_title category, cl.cl_to parent *
> *FROM categorylinks cl *
> *JOIN page p *
> *ON p.page_id = cl.cl_from *
> *where p.page_namespace = 14*​
>
>
> ​[3] https://en.wikipedia.org/wiki/Flow_cytometry_bioinformatics​
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] category extraction question

2017-07-10 Thread Leila Zia
Hi all,

[If you are not interested in discussions related to the category system
​ (on English Wikipedia)​
, you can stop here. :)]

We have run into a problem that some of you may have thought about or
addressed before. We are trying to clean up the category system on English
Wikipedia by turning the category structure to an IS-A hierarchy. (The
output of this work can be useful for the research on template
recommendation [1], for example, but the use-cases won't stop there). One
issue that we are facing is the following:

We are currently
​using
 SQL dumps to extract categories associated with every article on English
Wikipedia (main namespace). [2]
​ Using this approach, we get 5 categories associated with Flow cytometry
bioinformatics article [3]:

Flow_cytometry
Bioinformatics

Wikipedia_articles_published_in_peer-reviewed_literature
Wikipedia_articles_published_in_PLOS_Computational_Biology
CS1_maint:_Multiple_names:_authors_list

​The problem is that only the first two categories are the ones we are
interested in. We have one cleaning step through which we only keep
categories that belong to category Article and that step removes the last
category above, but the other two Wikipedia_... remain there. We need to
somehow prune the data and clean it from those two categories.

One way we could do the above would be to parse wikitext instead of the SQL
dumps and focus on extracting categories marked by pattern [[Category:XX]],
but in that case, we would lose a good category such as
Guided_missiles_of_Norway​
​ because that's generated by a template.​

Any ideas on how we can start with a "cleaner" dataset of categories
related to the topic of the articles as opposed to maintenance related or
other types of categories?

Thanks,
Leila

[1] https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia
_stubs_across_languages

[2] The exact code we use is

SELECT p.page_id id, p.page_title title, cl.cl_to category
FROM categorylinks cl
JOIN page p
on cl.cl_from = p.page_id
where cl_type = 'page'
and page_namespace = 0
and page_is_redirect = 0

​and the edges of the category graph are extracted with

*SELECT p.page_title category, cl.cl_to parent *
*FROM categorylinks cl *
*JOIN page p *
*ON p.page_id = cl.cl_from *
*where p.page_namespace = 14*​


​[3] https://en.wikipedia.org/wiki/Flow_cytometry_bioinformatics​
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Fwd: [Wikimedia-l] [fellowship] Opportunity for people working on "open projects that support a healthy Internet."

2017-07-10 Thread Pine W
Forwarding.

Pine


-- Forwarded message --
From: Melody Kramer 
Date: Mon, Jul 10, 2017 at 2:26 PM
Subject: [Wikimedia-l] [fellowship] Opportunity for people working on "open
projects that support a healthy Internet."
To: wikimedi...@lists.wikimedia.org


Hi all,

I wanted to pass along an opportunity that I saw earlier today via Twitter:
https://medium.com/read-write-participate/work-in-the-open-
with-mozilla-1410be0a83b2

It sets up people working on "open projects that support a healthy
Internet" with a mentor, a cohort of like-minded people from all over the
world, and a trip to Mozfest, which is a London-based open Internet
conference I've attended/presented at in past years and found really
mind-expanding due to the cross-disciplinary conversations that take place.

You can see previous projects here: https://mozilla.github.
io/leadership-training/round-3/projects/ — it looks like there's quite a
broad cross-section and many of the projects across the movement might be
applicable. The post notes participants will learn about "best practices
for project setup and communication, tools for collaboration, community
building, and running events."

Thank you to Leila for suggesting I pass this along to this listserv. Feel
free to share it broadly.


- Mel


--
Melody Kramer 
Senior Audience Development Manager
Read a random featured article from Wikipedia!


mkra...@wikimedia.org
___
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
wiki/Wikimedia-l
New messages to: wikimedi...@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Recognizing domain experts contribution to Wikipedia

2017-07-10 Thread Aaron Halfaker
> result in increased quality score (per ORES)

<3  rock on

On Mon, Jul 10, 2017 at 3:15 PM, Shani Evenstein 
wrote:

> Hi Alex,
>
> Welcome to the community!
>
> I am based at Tel Aviv University, where I teach 2 Wiki courses I developed
> and reseach Wikipedia & Wikidata (among others). If there's anything I can
> do to help, I'm a phone call away.  :-)
>
> Best,
> Shani.
>
> On 10 Jul 2017 23:02, "Alex Yarovoy"  wrote:
>
> > Thank you Leila, Stuart, Pine
> > We will follow up on these comments and pointers
> >
> > A few additional words about this research -
> > Our narrow definition of formal expertise focuses on those with academic
> > qualifications who have published a scholarly work (i.e. appears in
> Google
> > Scholar) in the topic of the specific Wikipedia articles where one was
> > active.
> > We acknowledge that many experts do not have academic qualifications.
> > The choice of "formal" (i.e. academic in this context) expertise enabled
> a
> > concrete operationalization and measurement.
> > We welcome any ideas for pinpointing informal experts.
> >
> > We are currently in the first phase of research where we try to identify
> > these formal experts. We've spent considerable amount of time in
> > identifying 500 such experts, and now we use machine learning techniques
> to
> > automatically spot them (preliminary results are quite good).
> > Once this is done, we can start asking interesting questions, such as:
> > - What is the relative role of these formal experts to overall content
> > contributed to Wikipedia?
> > - Are formal experts' contributions "better"? (e.g. survive longer or
> > result in increased quality score (per ORES)
> > - Who are those formal experts? anonymous contributors? registered users?
> > do they take additional roles within the community?
> > - Formal experts' motivation
> >
> > Any other ideas for taking this research forward are more than welcome.
> >
> > Thank you,
> > Ofer, Einat and Alex
> > ___
> > Wiki-research-l mailing list
> > Wiki-research-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Recognizing domain experts contribution to Wikipedia

2017-07-10 Thread Shani Evenstein
Hi Alex,

Welcome to the community!

I am based at Tel Aviv University, where I teach 2 Wiki courses I developed
and reseach Wikipedia & Wikidata (among others). If there's anything I can
do to help, I'm a phone call away.  :-)

Best,
Shani.

On 10 Jul 2017 23:02, "Alex Yarovoy"  wrote:

> Thank you Leila, Stuart, Pine
> We will follow up on these comments and pointers
>
> A few additional words about this research -
> Our narrow definition of formal expertise focuses on those with academic
> qualifications who have published a scholarly work (i.e. appears in Google
> Scholar) in the topic of the specific Wikipedia articles where one was
> active.
> We acknowledge that many experts do not have academic qualifications.
> The choice of "formal" (i.e. academic in this context) expertise enabled a
> concrete operationalization and measurement.
> We welcome any ideas for pinpointing informal experts.
>
> We are currently in the first phase of research where we try to identify
> these formal experts. We've spent considerable amount of time in
> identifying 500 such experts, and now we use machine learning techniques to
> automatically spot them (preliminary results are quite good).
> Once this is done, we can start asking interesting questions, such as:
> - What is the relative role of these formal experts to overall content
> contributed to Wikipedia?
> - Are formal experts' contributions "better"? (e.g. survive longer or
> result in increased quality score (per ORES)
> - Who are those formal experts? anonymous contributors? registered users?
> do they take additional roles within the community?
> - Formal experts' motivation
>
> Any other ideas for taking this research forward are more than welcome.
>
> Thank you,
> Ofer, Einat and Alex
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Recognizing domain experts contribution to Wikipedia

2017-07-10 Thread Alex Yarovoy
Thank you Leila, Stuart, Pine
We will follow up on these comments and pointers

A few additional words about this research -
Our narrow definition of formal expertise focuses on those with academic
qualifications who have published a scholarly work (i.e. appears in Google
Scholar) in the topic of the specific Wikipedia articles where one was
active.
We acknowledge that many experts do not have academic qualifications.
The choice of "formal" (i.e. academic in this context) expertise enabled a
concrete operationalization and measurement.
We welcome any ideas for pinpointing informal experts.

We are currently in the first phase of research where we try to identify
these formal experts. We've spent considerable amount of time in
identifying 500 such experts, and now we use machine learning techniques to
automatically spot them (preliminary results are quite good).
Once this is done, we can start asking interesting questions, such as:
- What is the relative role of these formal experts to overall content
contributed to Wikipedia?
- Are formal experts' contributions "better"? (e.g. survive longer or
result in increased quality score (per ORES)
- Who are those formal experts? anonymous contributors? registered users?
do they take additional roles within the community?
- Formal experts' motivation

Any other ideas for taking this research forward are more than welcome.

Thank you,
Ofer, Einat and Alex
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] EventStreams launch and RCStream deprecation

2017-07-10 Thread Andrew Otto
Alright, we’ve done it!

RCStream is disabled, so any remaining socket.io service connecting to
stream.wikimedia.org/rc will fail.

Thanks all!


On Thu, Jun 22, 2017 at 1:00 PM, Andrew Otto  wrote:

> Hi all,
>
> This is just a friendly reminder that we plan to turn off the RCStream
> service after July 7th.
>
> We’re tracking as best we can the progress of porting clients over at
> https://phabricator.wikimedia.org/T156919.  But, we can only help with
> what we know about.  If you’ve got something still running on RCStream that
> hasn’t yet ported, let us know, and/or switch soon!
>
> Thanks!
> -Andrew Otto
>
>
>
> On Wed, Feb 8, 2017 at 9:28 AM, Andrew Otto  wrote:
>
>> Hi everyone!
>>
>> Wikimedia is releasing a new service today: EventStreams
>> .  This service allows
>> us to publish arbitrary streams of JSON event data to the public.
>> Initially, the only stream available will be good ol’ RecentChanges
>> .  This event stream
>> overlaps functionality already provided by irc.wikimedia.org and RCStream
>> .  However, this new
>> service has advantages over these (now deprecated) services.
>>
>>
>>1.
>>
>>We can expose more than just RecentChanges.
>>2.
>>
>>Events are delivered over streaming HTTP (chunked transfer) instead
>>of IRC or socket.io.  This requires less client side code and fewer
>>special routing cases on the server side.
>>3.
>>
>>Streams can be resumed from the past.  By using EventSource, a
>>disconnected client will automatically resume the stream from where it 
>> left
>>off, as long as it resumes within one week.  In the future, we would like
>>to allow users to specify historical timestamps from which they would like
>>to begin consuming, if this proves safe and tractable.
>>
>>
>> I did say deprecated!  Okay okay, we may never be able to fully deprecate
>> irc.wikimedia.org.  It’s used by too many (probably sentient by now)
>> bots out there.  We do plan to obsolete RCStream, and to turn it off in a
>> reasonable amount of time.  The deadline iis July 7th, 2017.  All
>> services that rely on RCStream should migrate to the HTTP based
>> EventStreams service by this date.  We are committed to assisting you in
>> this transition, so let us know how we can help.
>>
>> Unfortunately, unlike RCStream, EventStreams does not have server side
>> event filtering (e.g. by wiki) quite yet.  How and if this should be done
>> is still under discussion .
>>
>> The RecentChanges data you are used to remains the same, and is available
>> at https://stream.wikimedia.org/v2/stream/recentchange. However, we may
>> have something different for you, if you find it useful. We have been
>> internally producing new Mediawiki specific events
>> 
>> for a while now, and could expose these via EventStreams as well.
>>
>> Take a look at these events, and tell us what you think.  Would you find
>> them useful?  How would you like to subscribe to them?  Individually as
>> separate streams, or would you like to be able to compose multiple event
>> types into a single stream via an API?  These things are all possible.
>>
>> I asked for a lot of feedback in the above paragraphs.  Let’s try and
>> centralize this discussion over on the mediawiki.org EventStreams talk
>> page .   In summary,
>> the questions are:
>>
>>
>>-
>>
>>What RCStream clients do you maintain, and how can we help you
>>migrate to EventStreams?
>>
>>-
>>
>>Is server side filtering, by wiki or arbitrary event field, useful to
>>you? 
>>-
>>
>>Would you like to consume streams other than RecentChanges?
>>  (Currently
>>available events are described here
>>
>> 
>>.)
>>
>>
>>
>> Thanks!
>> - Andrew Otto
>>
>>
>>
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l