Ehm. So "ThisIsAPage" => "This Is APage", "XRay" => "XRay" (and not
"X Ray") and "PresidentUThant" => "President UThant"?
I fail to see where this becomes easier than the current method, which
is fairly straightforward:
1) Clean the page first by collapsing all spaces so that only one
space remains per whitespace sequence. Find a page by that name.
2) IFF the page in 1) did not exist, collapse the rest of the spaces
away, then try finding that page. This keeps it compatible with the
CamelCase syntax and the pre 2.6 -method. We can make this an option
in 3.0, and for new installs, turn it off so they will not create a
legacy.
The beatifyTitle() kind of attempts to do exactly this, but it's not a
particularly efficient and is known to break in several cases.
I also think changing the page names upon import is a bad idea, since
it will e.g. break certain types of search and possible scripts which
might rely on pages named in a certain pattern. I'd say let's keep
the page names as-is, since it's likely that the users have chosen the
pagenames with a particular purpose in mind, and I'd rather not go
around second-guessing them.
/Janne
On Nov 5, 2009, at 18:21 , Andrew Jaquith wrote:
Yes, that is right. While the examples do look a bit odd, I'd point
out that MYPAGE in all-caps is an acronym.
There is one refinement we could make to rule (3): add a space when
an uppercase letter follows a lowercase letter. so PageJSPWiki would
expand to Page JSPWiki.
That is still fairly simple, while producing good results.
On Nov 5, 2009, at 0:49, Harry Metske <[email protected]> wrote:
so that would mean for example :
[MYPAGE] => [ M Y P A G E ]
[IPPhone] => [ I P Phone]
[mypagE] => [mypag E]
looks a bit odd to me
2009/11/5 Andrew Jaquith <[email protected]>
I'd define it as "an uppercase latter that follows a non-whitespace
character."
On Wed, Nov 4, 2009 at 2:52 PM, Harry Metske
<[email protected]>
wrote:
agreed on the 1) and 2)
But how exactly do you define "adding a space before each uppercase
letter
that starts a word" ?
How do you find this "uppercase letter that starts a word" in a
pagename
or
link ?
Can you give a few samples ?
/Harry
2009/11/2 Andrew Jaquith <[email protected]>
Ok, that makes sense. I can think of cases in English too, like
"averse" (opposed to) and "a verse" (a portion of a song or
poem). I
just decided that I didn't care. :)
But assuming we do care...
...what about going the other way: on import, or on page save,
or page
lookup, forcibly expanding CamelCasePageNames (and inline page
links)
so that they have one space in between the words? That way,
case-insensitive matching with spaces preserved (trimmed to one
space)
would work.
So, the rules would be this:
(1) When links in pages are parsed, or page names are saved,
leading
and trailing spaces will be trimmed, and all whitespace between
words
will be replaced with one space character.
(2) Whitespace before and after the space name will be removed.
(3) CamelCase page links or page names will be normalize by
adding a
space before each uppercase letter that starts a word
(4) Tests for page name equality are done by applying rules
(1) , (2)
and (3) and making a case-insensitive comparison.
That seems simple enough, no?
Andrew
On Mon, Nov 2, 2009 at 2:44 PM, Janne Jalkanen <[email protected]
>
wrote:
Can you provide some examples where a
strip-the-whitespace-and-do-a-case-insensitive-comparison
strategy
would not work, in Finnish? I'd like to understand this,
seriously.
E.g. "maan alle" vs "maanalle". First means "into the ground",
the
next one is "earth bear".
Or "kuusi puuta" vs "kuusipuuta" - "six trees" vs "at a
fir" (or "of
fir timber").
Or simply "sivusta katsoja" vs "sivustakatsoja" - "a person who
looks
(literally) from the sides" vs "onlooker". The difference is
subtler
than with the previous ones, but the existence of the space is
significant information.
In fact, getting mixed up when two words go together and when
they do
not is one of the most common grammatical errors. Sometimes the
results can be fairly hilarious and unintended. Often it looks
just
sad.
But the point being that in Finnish (and other so-called
constructed
languages), whitespace is significant. So it should not be
ignored
arbitrarily.
Besids, I am not aware of any wikiengines who would consider
whitespace insignificant in determining pagename equality.
mediawiki's
rules concerning spaces are:
<snip>
Spaces/underscores which are ignored:
* those at the start and end of a full page name
* those at the end of a namespace prefix, before the colon
* those after the colon of the namespace prefix
* duplicate consecutive spaces
<snap>
FYI, I took a look at JSPWiki.org to see what the scale of the
problem
might be. The site has about 4850 pages. I yanked down all of
the
page
names and compared them. I detected exactly ONE name clash:
"Text
formatting rulesKorean" and "TextformattingrulesKorean" appear
to be
different pages. That is a 0.02% collision rate -- and easily
handled
by a rename-on-import or special-page redirection strategy.
That's not what I meant. I meant that we have many links of
the form
[word1 word2] embedded within running text. If we change
those, then
the running text becomes meaningless and needs to be *checked by
hand*.
/Janne