[MediaWiki-commits] [Gerrit] [FIX] Allow dynamic interwiki links - change (pywikibot/core)

2014-10-13 Thread jenkins-bot (Code Review)
jenkins-bot has submitted this change and it was merged.

Change subject: [FIX] Allow dynamic interwiki links
..


[FIX] Allow dynamic interwiki links

This uses the 'interwikimap' siteinfo to determine the Site an
interwiki prefix is redirecting to. A Family supports 'from_url' which
returns the 'code' (an entry in 'langs') to which the URL belongs to.
If no code of that family does support that URL 'None' is returned.

Because both the test and wikipedia family support the URL
test.wikipedia.org, the test family returns always None so it chooses the
wikipedia family. This is then consistent with the test wikidata instance
which is in the wikidata family and not in a separate test family.

To get the family for a specific URL it might need to iterate through all
families, which are loaded in the process. Each site has a local cache which
holds the Site represented by a specific interwiki prefix. There is also a
global cache which stores the family name and code for each URL so that later
searches for the URL don't require to iterate through all families.

This also adds an parameter url to pywikibot.Site function which can
only be set when neither code nor fam are set.

The code used in 'nice_get_address' should be returned when the url is parsed
by 'from_url'. This also fixes all link tests so that the 'show_failures'
features could be removed.

It also changes the exception from Error in InvalidTitle, when there is
no title given and it has a namespace or no interwiki link. It also only
throws an InvalidTitle if an interwiki link links to a non local page
via a local interwiki link. Previously all interwiki links which
referred to it's own site wouldn't be allowed.

Change-Id: Iec4f32fdefde15b2330dbb191fb3bbb0d04803e2
---
M pywikibot/__init__.py
M pywikibot/families/test_family.py
M pywikibot/family.py
M pywikibot/page.py
M pywikibot/site.py
M tests/interwiki_link_tests.py
M tests/link_tests.py
7 files changed, 354 insertions(+), 288 deletions(-)

Approvals:
  John Vandenberg: Looks good to me, approved
  jenkins-bot: Verified



diff --git a/pywikibot/__init__.py b/pywikibot/__init__.py
index 5257f65..2e569c0 100644
--- a/pywikibot/__init__.py
+++ b/pywikibot/__init__.py
@@ -459,9 +459,10 @@
 
 
 _sites = {}
+_url_cache = {}  # The code/fam pair for each URL
 
 
-def Site(code=None, fam=None, user=None, sysop=None, interface=None):
+def Site(code=None, fam=None, user=None, sysop=None, interface=None, url=None):
 A factory method to obtain a Site object.
 
 Site objects are cached and reused by this method.
@@ -479,12 +480,40 @@
 @type sysop: unicode
 @param interface: site interface (override config.site_interface)
 @type interface: string
+@param url: Instead of code and fam, does try to get a Site based on the
+URL. Still requires that the family supporting that URL exists.
+@type url: string
 
+# Either code and fam or only url
+assert(not url or (not code and not fam))
 _logger = wiki
 
-# Fallback to config defaults
-code = code or config.mylang
-fam = fam or config.family
+if url:
+if url in _url_cache:
+cached = _url_cache[url]
+if cached:
+code = cached[0]
+fam = cached[1]
+else:
+raise Error(Unknown URL '{0}'..format(url))
+else:
+# Iterate through all families and look, which does apply to
+# the given URL
+for fam in config.family_files:
+family = pywikibot.family.Family.load(fam)
+code = family.from_url(url)
+if code:
+_url_cache[url] = (code, fam)
+break
+else:
+_url_cache[url] = None
+# TODO: As soon as AutoFamily is ready, try and use an
+#   AutoFamily
+raise Error(Unknown URL '{0}'..format(url))
+else:
+# Fallback to config defaults
+code = code or config.mylang
+fam = fam or config.family
 interface = interface or config.site_interface
 
 # config.usernames is initialised with a dict for each family name
diff --git a/pywikibot/families/test_family.py 
b/pywikibot/families/test_family.py
index 27d5f4b..c2cc459 100644
--- a/pywikibot/families/test_family.py
+++ b/pywikibot/families/test_family.py
@@ -13,3 +13,6 @@
 self.langs = {
 'test': 'test.wikipedia.org',
 }
+
+def from_url(self, url):
+return None  # Don't accept this, but 'test' of 'wikipedia'
diff --git a/pywikibot/family.py b/pywikibot/family.py
index 1bfd558..6765176 100644
--- a/pywikibot/family.py
+++ b/pywikibot/family.py
@@ -1035,6 +1035,64 @@
 def nice_get_address(self, code, title):
 return '%s%s' % (self.nicepath(code), title)
 
+def _get_path_regex(self):
+
+Return a regex 

[MediaWiki-commits] [Gerrit] [FIX] Allow dynamic interwiki links - change (pywikibot/core)

2014-09-20 Thread XZise (Code Review)
XZise has uploaded a new change for review.

  https://gerrit.wikimedia.org/r/161671

Change subject: [FIX] Allow dynamic interwiki links
..

[FIX] Allow dynamic interwiki links

This uses the 'interwikimap' siteinfo to determine the Site an
interwiki prefix is redirecting to. Each Family has now a method
'from_url' which returns the 'code' (an entry in 'langs') to which the
URL belongs to. If no code does support that URL 'None' is returned.

The family 'test' always returns None, because there are two families
which support the URL: 'wikipedia' and 'test' (both code 'test'). As
test.wikidata.org is in the wikidata family, returning None prefers the
wikipedia family (so both are consistent).

This needs to iterate through all families and is basically loading all
non lazy. Also 'from_url' should always return the same code when a
link is generated using 'nice_get_address'.

Change-Id: Iec4f32fdefde15b2330dbb191fb3bbb0d04803e2
---
M pywikibot/families/lockwiki_family.py
M pywikibot/families/test_family.py
M pywikibot/family.py
M pywikibot/page.py
M pywikibot/site.py
5 files changed, 132 insertions(+), 53 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/pywikibot/core 
refs/changes/71/161671/1

diff --git a/pywikibot/families/lockwiki_family.py 
b/pywikibot/families/lockwiki_family.py
index 0a4d0a4..bf4f578 100644
--- a/pywikibot/families/lockwiki_family.py
+++ b/pywikibot/families/lockwiki_family.py
@@ -29,7 +29,7 @@
 
 def nicepath(self, code):
 Return the nice article path for this family.
-return %s/ % self.path(self, code)
+return %s/ % self.path(code)
 
 def protocol(self, code):
 Return https as the protocol for this family.
diff --git a/pywikibot/families/test_family.py 
b/pywikibot/families/test_family.py
index 27d5f4b..c2cc459 100644
--- a/pywikibot/families/test_family.py
+++ b/pywikibot/families/test_family.py
@@ -13,3 +13,6 @@
 self.langs = {
 'test': 'test.wikipedia.org',
 }
+
+def from_url(self, url):
+return None  # Don't accept this, but 'test' of 'wikipedia'
diff --git a/pywikibot/family.py b/pywikibot/family.py
index a38f471..98416cf 100644
--- a/pywikibot/family.py
+++ b/pywikibot/family.py
@@ -21,6 +21,7 @@
 import pywikibot
 from pywikibot import config2 as config
 from pywikibot.tools import deprecated
+from pywikibot.exceptions import Error
 
 logger = logging.getLogger(pywiki.wiki.family)
 
@@ -986,6 +987,64 @@
 def nice_get_address(self, code, title):
 return '%s%s' % (self.nicepath(code), title)
 
+def _get_path_regex(self):
+
+Return a regex matching the path after the domain.
+
+It is using L{Family.path} and L{Family.nicepath} with code set to
+'None'. If that returns a KeyError (L{Family.scriptpath} probably
+using the C{langs} dictionary) it retries it with the key from
+L{Family.langs} if it only contains one entry and throws an Error
+otherwise. In that case the Family instance should overwrite this
+method or supply code independent methods.
+
+@raise Error: If it's not possible to automatically get a code
+independent regex.
+
+def _get_coded_path_regex(code):
+return ('(?:' + re.escape(self.path(code) + '/') + '|' +
+re.escape(self.nicepath(code)) + ')')
+try:
+return _get_coded_path_regex(None)
+except KeyError:
+# Probably automatically generated family
+if len(self.langs) == 1:
+return _get_coded_path_regex(next(iter(self.langs.keys(
+else:
+raise Error('Pywikibot is unable to generate an automatic '
+'path regex for the family {0}. It is recommended '
+'to overwrite _get_path_regex in that '
+'family.'.format(self.name))
+
+def from_url(self, url):
+
+Return whether this family matches the given url.
+
+The protocol must match, if it is present in the URL. It must match
+URLs generated via C{self.langs} and L{Family.nice_get_address} or
+L{Family.path}.
+
+It uses L{Family._get_path_regex} to generate a regex defining the path
+after the domain.
+
+@return: The language code of the url. None if that url is not from
+this family.
+@rtype: str or None
+
+url_match = re.match(r'(?:(https?)://|//)?(.*){0}'
+ '\$1'.format(self._get_path_regex()), url)
+if not url_match:
+return None
+for code, domain in self.langs.items():
+if domain == url_match.group(2):
+break
+else:
+return None
+if url_match.group(1) and url_match.group(1) != self.protocol(code):
+return None
+