Re: [Wiki-research-l] Finding the number of links between two wikipedia pages

2017-02-24 Thread Mara Sorella
Thank you, Giovanni, I'll check it out!

Mara

On Fri, Feb 24, 2017 at 10:59 PM, Giovanni Luca Ciampaglia <
gciam...@indiana.edu> wrote:

> Hi Mara,
>
> since you were asking about ontologies, let me point you to our work on 
> computational
> fact checking from knowledge networks PLoS ONE
> <http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0128193>.
> We developed a measure of semantic similarity based on shortest paths
> between any two concepts of Wikipedia using the linked data from DBPedia;
> these the are links found in the infoboxes of Wikipedia articles; so it is
> a subset of the hyperlinks of the whole web page.
>
> In the article we use it as a way to check simple relational statements,
> but it could be used for other uses too. And there are also a couple other
> approaches from the literature, which we cite in the paper, that could also
> be relevant for what you are doing.
>
> HTH!
>
> Giovanni
>
>
> Giovanni Luca Ciampaglia <http://glciampaglia.com> *∙* Assistant Research
> Scientist, Indiana University
>
>
> On Sun, Feb 19, 2017 at 2:56 PM, Mara Sorella <sore...@dis.uniroma1.it>
> wrote:
>
>> Hi everybody, I'm new to the list and have been referred here by a
>> comment from a SO user as per my question [1], that I'm quoting next:
>>
>>
>> I
>>
>>
>>
>> * have been successfully able to use the Wikipedia pagelinks SQL dump to
>> obtain hyperlinks between Wikipedia pages for a specific revision
>> time.However, there are cases where multiple instances of such links exist,
>> e.g. the very same https://en.wikipedia.org/wiki/Wikipedia
>> <https://en.wikipedia.org/wiki/Wikipedia> page and
>> https://en.wikipedia.org/wiki/Wikimedia_Foundation
>> <https://en.wikipedia.org/wiki/Wikimedia_Foundation>. I'm interested to
>> find number of links between pairs of pages for a specific revision. Ideal
>> solutions would involve dump files other than pagelinks (which I'm not
>> aware of), or using the MediaWiki API.*
>>
>>
>>
>> To elaborate, I need this information to weight (almost) every hyperlink
>> between article pages (that is, in NS0), that was present in a specific
>> wikipedia revision (end of 2015), therefore, I would prefer not to follow
>> the solution suggested by the SO user, that would be rather impractical.
>>
>> Indeed, my final aim is to use this weight in a thresholding fashion to
>> sparsify the wikipedia graph (that due to the short diameter is more or
>> less a giant connected component), in a way that should reflect the
>> "relatedness" of the linked pages (where relatedness is not intended as
>> strictly semantic, but at a higher "concept" level, if I may say so).
>> For this reason, other suggestions on how determine such weights
>> (possibly using other data sources -- ontologies?) are more than welcome.
>>
>> The graph will be used as dataset to test an event tracking algorithm I
>> am doing research on.
>>
>>
>> Thanks,
>>
>> Mara
>>
>>
>>
>>
>> [1] http://stackoverflow.com/questions/4223/number-of-links-
>> between-two-wikipedia-pages/
>>
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Finding the number of links between two wikipedia pages

2017-02-22 Thread Mara Sorella
Hi Giuseppe, Ward

On Tue, Feb 21, 2017 at 5:48 PM, Giuseppe Profiti <profgiuse...@gmail.com>
wrote:

> 2017-02-19 20:56 GMT+01:00 Mara Sorella <sore...@dis.uniroma1.it>:
> > Hi everybody, I'm new to the list and have been referred here by a
> comment
> > from a SO user as per my question [1], that I'm quoting next:
> >
> >
> > I have been successfully able to use the Wikipedia pagelinks SQL dump to
> > obtain hyperlinks between Wikipedia pages for a specific revision time.
> >
> > However, there are cases where multiple instances of such links exist,
> e.g.
> > the very same https://en.wikipedia.org/wiki/Wikipedia page and
> > https://en.wikipedia.org/wiki/Wikimedia_Foundation. I'm interested to
> find
> > number of links between pairs of pages for a specific revision.
> >
> > Ideal solutions would involve dump files other than pagelinks (which I'm
> not
> > aware of), or using the MediaWiki API.
> >
> >
> >
> > To elaborate, I need this information to weight (almost) every hyperlink
> > between article pages (that is, in NS0), that was present in a specific
> > wikipedia revision (end of 2015), therefore, I would prefer not to follow
> > the solution suggested by the SO user, that would be rather impractical.
>
> Hi Mara,
> Mediawiki API does not return the multiplicity of the links [1]. As
> far as I can see from the database layout, you can't get the
> multiplicity of links from it either [2]. The only solution that
> occurs to me is to parse the wikitext of the page, like the SO user
> suggested.
>
> In any case, some communities established writing styles that
> discourage multiple links towards the same article (i.e. in the
> Italian Wikipedia a link is associated only to the first occurrence of
> the word). Then, the numbers you could get may vary depending on the
> style of the community and/or last editor.
>
Yes, this is a good practice that I noticed being very widespread. Indeed
this would lead the link-multiplicity based weighting approach to fail.
A (costly) option would be inspecting the actual article text (possibly
only the abstract). I guess this can be done starting from the dump files.

@Ward: could your technology be of help for this task?


> >
> > Indeed, my final aim is to use this weight in a thresholding fashion to
> > sparsify the wikipedia graph (that due to the short diameter is more or
> less
> > a giant connected component), in a way that should reflect the
> "relatedness"
> > of the linked pages (where relatedness is not intended as strictly
> semantic,
> > but at a higher "concept" level, if I may say so).
> > For this reason, other suggestions on how determine such weights
> (possibly
> > using other data sources -- ontologies?) are more than welcome.
>
> When you get the graph of connections, instead of using the
> multiplicity as weight, you could try to use community detection
> methods to isolate subclusters of strongly connected articles.
> Another approach my be to use centrality measures, however the only
> one that can be applied to edges instead of just nodes is betweenness
> centrality, if I remember correctly.
>

Currently, I resorted to keep only reciprocal links, but I still get quite
big connected components (despite the fact that I'm actually carrying out a
temporal analysis, where I consider, for each time instant, only pages
exhibiting an unusually high traffic).
Concerning community detection techniques/centrality: I discarded them
because I don't want to "impose" connectedness (reachability) at the
subgraph level, but only between single entities (since my algorithm aims
to find some sort of temporally persistent subgraphs having some
properties).


> In case of a fast technical solution may come to mind, I'll write here
> again.
>
> Best,
> Giuseppe
>
> [1] https://en.wikipedia.org/w/api.php?action=query=links;
> titles=Wikipedia=0=500=Wikimedia_Foundation
> [2] https://upload.wikimedia.org/wikipedia/commons/9/94/MediaWik
> i_1.28.0_database_schema.svg
>
>
Thank you both for your feedback!

Best,

Mara
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Finding the number of links between two wikipedia pages

2017-02-19 Thread Mara Sorella
Hi everybody, I'm new to the list and have been referred here by a comment
from a SO user as per my question [1], that I'm quoting next:


I



* have been successfully able to use the Wikipedia pagelinks SQL dump to
obtain hyperlinks between Wikipedia pages for a specific revision
time.However, there are cases where multiple instances of such links exist,
e.g. the very same https://en.wikipedia.org/wiki/Wikipedia
 page and
https://en.wikipedia.org/wiki/Wikimedia_Foundation
. I'm interested to
find number of links between pairs of pages for a specific revision. Ideal
solutions would involve dump files other than pagelinks (which I'm not
aware of), or using the MediaWiki API.*



To elaborate, I need this information to weight (almost) every hyperlink
between article pages (that is, in NS0), that was present in a specific
wikipedia revision (end of 2015), therefore, I would prefer not to follow
the solution suggested by the SO user, that would be rather impractical.

Indeed, my final aim is to use this weight in a thresholding fashion to
sparsify the wikipedia graph (that due to the short diameter is more or
less a giant connected component), in a way that should reflect the
"relatedness" of the linked pages (where relatedness is not intended as
strictly semantic, but at a higher "concept" level, if I may say so).
For this reason, other suggestions on how determine such weights (possibly
using other data sources -- ontologies?) are more than welcome.

The graph will be used as dataset to test an event tracking algorithm I am
doing research on.


Thanks,

Mara




[1]
http://stackoverflow.com/questions/4223/number-of-links-between-two-wikipedia-pages/
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l