Re: [Wikitech-l] Getting the list of Page Titles and Redirects of Wikipedia

2009-03-20 Thread O. O.
  Thanks Roan.

Roan Kattouw wrote:
 Aryeh Gregor schreef:
 http://en.wikipedia.org/w/api.php?action=querymeta=siteinfosiprop=namespaces
 
 Note that namespaces with an ID of 100 or higher are specific to enwiki 
 and may have different names or not be used at all on other wikis. To 
 get an accurate list for another wiki, ask that wiki's api.php .

Your above link helped me a lot. I am only concerned about the English 
Version of the Wikipedia – so I think that is all I need. I was afraid 
that there might have been other values of “namespace” but the result of

SELECT page_namespace, page_title FROM wikidb.page WHERE (page_namespace 
 -2 OR page_namespace 15) AND page_namespace != 100 AND page_namespace 
!= 101;

confirms that there are no custom namespaces other than those listed in 
your Link. So I am good here.




 As for redirects: yes, you'll want to do something like:
 
 SELECT page_namespace, page_title, rd_namespace, rd_title
 FROM page LEFT JOIN redirect ON rd_from=page_id;
 

Actually you simply need a JOIN here i.e. Inner Join (that would 
eliminate the NULLs) something like

SELECT page_namespace, page_title, rd_namespace, rd_title FROM 
wikidb.page JOIN wikidb.redirect ON rd_from=page_id;

(NOTE: I did not rebuild the Redirect Table, but I simply imported it as 
it is provided in SQL format with the other Wiki Dumps.)

My problem is the following:

I cannot understand why there are more redirects in the redirect table 
than there are in the pages in the Page table with the redirect flag set.

select count(*) from wikidb.redirect ;   ---  Gives 3422938

while

select count(*) from wikidb.page where page_is_redirect=1;   ---  Gives 
3242340

Why should there be this difference? Are there redirects that are not 
listed in the Page Table?

Also
SELECT count(*)  FROM wikidb.page JOIN wikidb.redirect ON 
rd_from=page_id;  ---  Gives  3210708

Any ideas why the difference?
Thanks again,
O. O.




___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Getting the list of Page Titles and Redirects of Wikipedia

2009-03-20 Thread Aryeh Gregor
On Fri, Mar 20, 2009 at 1:08 PM, O. O. olson...@yahoo.com wrote:
 I cannot understand why there are more redirects in the redirect table
 than there are in the pages in the Page table with the redirect flag set.

 select count(*) from wikidb.redirect ;   ---  Gives 3422938

 while

 select count(*) from wikidb.page where page_is_redirect=1;   ---  Gives
 3242340

 Why should there be this difference? Are there redirects that are not
 listed in the Page Table?

It's denormalized data, so I'd assume that those are just errors of
some kind.  If you need to be really really sure whether something is
a redirect, you have to parse its text (it's just a regex, should be
somewhere in includes/Article.php or something).  That's the
authoritative check, which IIRC is used when actually deciding whether
to redirect.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Getting the list of Page Titles and Redirects of Wikipedia

2009-03-20 Thread O. O.
Thanks Aryeh.

Aryeh Gregor wrote:
 On Fri, Mar 20, 2009 at 1:08 PM, O. O. olson...@yahoo.com wrote:
 I cannot understand why there are more redirects in the redirect table
 than there are in the pages in the Page table with the redirect flag set.

 select count(*) from wikidb.redirect ;   ---  Gives 3422938

 while

 select count(*) from wikidb.page where page_is_redirect=1;   ---  Gives
 3242340

 Why should there be this difference? Are there redirects that are not
 listed in the Page Table?
 
 It's denormalized data, so I'd assume that those are just errors of
 some kind.  If you need to be really really sure whether something is
 a redirect, you have to parse its text (it's just a regex, should be
 somewhere in includes/Article.php or something).  That's the
 authoritative check, which IIRC is used when actually deciding whether
 to redirect.
 

As,
SELECT count(*)  FROM wikidb.page JOIN wikidb.redirect ON 
rd_from=page_id;  ---  Gives  3210708

Which is less than the total rows in the Redirect Table i.e.

select count(*) from wikidb.redirect ;   ---  Gives 3422938 which is 
more than 3210708

I think this might mean that there are entries in the Redirect Table 
where the rd_from does not correspond to a valid page_id in the Page 
Table? I don’t know what these would be used for though.

Thanks again,
O. O.



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Getting the list of Page Titles and Redirects of Wikipedia

2009-03-20 Thread Aryeh Gregor
On Fri, Mar 20, 2009 at 1:26 PM, O. O. olson...@yahoo.com wrote:
 I think this might mean that there are entries in the Redirect Table
 where the rd_from does not correspond to a valid page_id in the Page
 Table? I don’t know what these would be used for though.

They're probably just wrong.  The page may have been deleted and the
redirect entry not, for some reason.  This sort of thing happens with
denormalized data.  *In theory* everything is in transactions, so it
shouldn't happen, except when they aren't and so it does, or they are
and it still does somehow.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Getting the list of Page Titles and Redirects of Wikipedia

2009-03-20 Thread O. O.
Aryeh Gregor wrote:
 On Fri, Mar 20, 2009 at 1:26 PM, O. O. olson...@yahoo.com wrote:
 I think this might mean that there are entries in the Redirect Table
 where the rd_from does not correspond to a valid page_id in the Page
 Table? I don’t know what these would be used for though.
 
 They're probably just wrong.  The page may have been deleted and the
 redirect entry not, for some reason.  This sort of thing happens with
 denormalized data.  *In theory* everything is in transactions, so it
 shouldn't happen, except when they aren't and so it does, or they are
 and it still does somehow.

Thanks Aryeh for the explanations. Now it makes sense i.e. I should take 
these numbers with a pinch of salt, because there might be errors in the 
Wikipedia Database itself :) .

O. O.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Getting the list of Page Titles and Redirects of Wikipedia

2009-03-19 Thread Roan Kattouw
Platonides schreef:
 (it's helpfully provided in the API result . . . actually, what
 does it mean that Portal and Portal talk are canonical? shouldn't
 there be no canonical attribute if the namespace is custom?).
 
 Agree. Portal and Portal talk could still be acceptable, since the
 namespace ids 100-101 are more or less reserved for portals across the
 wikis.
 What is scaryier is seeing ns id=102 canonical=Cookbook on
 enwikibooks whereas the same ns 102 mean Wikiproject on some pedias.
 
 Since the API provides namespacealiases linked to the id, not to the
 informal canonical name I see no reason to keep the canonical
 parameter on the extra ns.
 
This was brought up before in bug 16672 comment #5. My reply was:

  b) custom namespaces shouldn't have a canonical name
Maybe, maybe not; I see arguments for and against. But since 
$wgCanonicalNames contains canonical names for custom namespaces too and 
since removing the canonical attribute for some namespaces but not 
others would violate expectations and be a breaking change, I'll just 
keep stuff the way it is. Regardless of whether custom namespaces should 
or shouldn't have a canonical name, removing it from the API output 
isn't worth the trouble.

Roan Kattouw (Catrope)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Getting the list of Page Titles and Redirects of Wikipedia

2009-03-18 Thread Aryeh Gregor
On Wed, Mar 18, 2009 at 6:18 AM, Petr Kadlec petr.kad...@gmail.com wrote:
 page_title does not contains the full title, only its
 namespace-relative part. You need to use
 select page_namespace, page_title from wikidb.page
 Only this whole tuple (page_namespace, page_title) is a unique
 identifier of a page (this is true for the whole MediaWiki).

And note that the namespace is stored as a number.  You'll need to
refer to a list of the namespace numbers on the specific wiki you're
dealing with to translate it into the appropriate prefix.  There's a
way to get the list from the API, but I don't know it offhand.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Getting the list of Page Titles and Redirects of Wikipedia

2009-03-18 Thread Roan Kattouw
Aryeh Gregor schreef:
 On Wed, Mar 18, 2009 at 6:18 AM, Petr Kadlec petr.kad...@gmail.com wrote:
 page_title does not contains the full title, only its
 namespace-relative part. You need to use
 select page_namespace, page_title from wikidb.page
 Only this whole tuple (page_namespace, page_title) is a unique
 identifier of a page (this is true for the whole MediaWiki).
 
 And note that the namespace is stored as a number.  You'll need to
 refer to a list of the namespace numbers on the specific wiki you're
 dealing with to translate it into the appropriate prefix.  There's a
 way to get the list from the API, but I don't know it offhand.
 
http://en.wikipedia.org/w/api.php?action=querymeta=siteinfosiprop=namespaces

Note that namespaces with an ID of 100 or higher are specific to enwiki 
and may have different names or not be used at all on other wikis. To 
get an accurate list for another wiki, ask that wiki's api.php .

As for redirects: yes, you'll want to do something like:

SELECT page_namespace, page_title, rd_namespace, rd_title
FROM page LEFT JOIN redirect ON rd_from=page_id;

This'll list all page titles and their redirect targets, with 
rd_namespace and rd_title set to NULL for pages that aren't redirects. 
Note that the redirect table doesn't handle section redirects (like 
redirects to [[Foo#Bar]], which are stored as redirects to [[Foo]]) and 
interwiki redirects (like redirects to [[wikt:dog]], which are stored as 
redirects to [[dog]]) too well and that some redirects may be missing 
from it entirely (IIRC about half a million redirects are missing from 
enwiki's redirect table). Even worse, the data dump you downloaded might 
not even contain the redirect table. You can rebuild the redirect table 
with:

php maintenance/refreshLinks.php --redirects-only

(Use --old-redirects-only to only add missing entries rather than 
checking existing entries for validity as well.)

Roan Kattouw (Catrope)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Getting the list of Page Titles and Redirects of Wikipedia

2009-03-18 Thread Petr Kadlec
2009/3/18 O. O. olson...@yahoo.com:
 This is fine, but where can I find information on custom namespaces i.e.
 those that lie above 100.

In $wgExtraNamespaces (see
http://www.mediawiki.org/wiki/Manual:Using_custom_namespaces)

-- [[cs:User:Mormegil | Petr Kadlec]]

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Getting the list of Page Titles and Redirects of Wikipedia

2009-03-18 Thread Aryeh Gregor
On Wed, Mar 18, 2009 at 11:06 AM, Roan Kattouw roan.katt...@home.nl wrote:
 http://en.wikipedia.org/w/api.php?action=querymeta=siteinfosiprop=namespaces

 Note that namespaces with an ID of 100 or higher are specific to enwiki
 and may have different names or not be used at all on other wikis. To
 get an accurate list for another wiki, ask that wiki's api.php .

The same is pretty much true for all namespaces.  There's no guarantee
that any namespaces other than main will have the same names on other
wikis.  To ensure that, you need to use the canonical name if one
exists (it's helpfully provided in the API result . . . actually, what
does it mean that Portal and Portal talk are canonical? shouldn't
there be no canonical attribute if the namespace is custom?).

In particular, Wikipedia and Wikipedia talk will likely not work
on most other wikis.

On Wed, Mar 18, 2009 at 11:54 AM, O. O. olson...@yahoo.com wrote:
 Thanks Petr and Aryeh for getting back. From the Documentation at
 http://www.mediawiki.org/wiki/Page_table and
 http://meta.wikimedia.org/wiki/Help:Namespace you can get the names of
 the Real and Virtual Namespeaces in includes/Defines.php and then get
 what text they convert to in English using
 languages/messages/MessagesEn.php.

 This is fine, but where can I find information on custom namespaces i.e.
 those that lie above 100.

Use Roan's link:

http://en.wikibooks.org/w/api.php?action=querymeta=siteinfosiprop=namespaces

You might prefer this to Defines.php/MessagesEn.php.  Those will give
you the canonical names, which will always work, but which might not
be the ones used on Wikipedia.  For instance, namespace 4 is
canonically Project, but on Wikipedia the normal name for it is
Wikipedia.  Project URLs will work on Wikipedia, but automatically
redirect to Wikipedia.  E.g.,

http://en.wikipedia.org/wiki/Project:WikiProject_Dorset

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Getting the list of Page Titles and Redirects of Wikipedia

2009-03-18 Thread Platonides
Aryeh Gregor wrote:
 On Wed, Mar 18, 2009 at 11:06 AM, Roan Kattouw roan.katt...@home.nl wrote:
 http://en.wikipedia.org/w/api.php?action=querymeta=siteinfosiprop=namespaces

 Note that namespaces with an ID of 100 or higher are specific to enwiki
 and may have different names or not be used at all on other wikis. To
 get an accurate list for another wiki, ask that wiki's api.php .
 
 The same is pretty much true for all namespaces.  There's no guarantee
 that any namespaces other than main will have the same names on other
 wikis.  To ensure that, you need to use the canonical name if one
 exists (it's helpfully provided in the API result . . . actually, what
 does it mean that Portal and Portal talk are canonical? shouldn't
 there be no canonical attribute if the namespace is custom?).

Agree. Portal and Portal talk could still be acceptable, since the
namespace ids 100-101 are more or less reserved for portals across the
wikis.
What is scaryier is seeing ns id=102 canonical=Cookbook on
enwikibooks whereas the same ns 102 mean Wikiproject on some pedias.

Since the API provides namespacealiases linked to the id, not to the
informal canonical name I see no reason to keep the canonical
parameter on the extra ns.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l