Ehm. So "ThisIsAPage" => "This Is APage", "XRay" => "XRay" (and not "X Ray") and "PresidentUThant" => "President UThant"?

I fail to see where this becomes easier than the current method, which is fairly straightforward:

1) Clean the page first by collapsing all spaces so that only one space remains per whitespace sequence. Find a page by that name.

2) IFF the page in 1) did not exist, collapse the rest of the spaces away, then try finding that page. This keeps it compatible with the CamelCase syntax and the pre 2.6 -method. We can make this an option in 3.0, and for new installs, turn it off so they will not create a legacy.

The beatifyTitle() kind of attempts to do exactly this, but it's not a particularly efficient and is known to break in several cases.

I also think changing the page names upon import is a bad idea, since it will e.g. break certain types of search and possible scripts which might rely on pages named in a certain pattern. I'd say let's keep the page names as-is, since it's likely that the users have chosen the pagenames with a particular purpose in mind, and I'd rather not go around second-guessing them.

/Janne

On Nov 5, 2009, at 18:21 , Andrew Jaquith wrote:

Yes, that is right. While the examples do look a bit odd, I'd point out that MYPAGE in all-caps is an acronym.

There is one refinement we could make to rule (3): add a space when an uppercase letter follows a lowercase letter. so PageJSPWiki would expand to Page JSPWiki.

That is still fairly simple, while producing good results.

On Nov 5, 2009, at 0:49, Harry Metske <[email protected]> wrote:

so that would mean for example :

[MYPAGE] => [ M Y P A G E ]
[IPPhone]   => [ I P Phone]
[mypagE] => [mypag E]

looks a bit odd to me


2009/11/5 Andrew Jaquith <[email protected]>

I'd define it as "an uppercase latter that follows a non-whitespace
character."

On Wed, Nov 4, 2009 at 2:52 PM, Harry Metske <[email protected]>
wrote:
agreed on the 1) and 2)

But how exactly do you define "adding a space before each uppercase
letter
that starts a word" ?
How do you find this "uppercase letter that starts a word" in a pagename
or
link ?
Can you give a few samples ?

/Harry

2009/11/2 Andrew Jaquith <[email protected]>

Ok, that makes sense. I can think of cases in English too, like
"averse" (opposed to) and "a verse" (a portion of a song or poem). I
just decided that I didn't care. :)

But assuming we do care...

...what about going the other way: on import, or on page save, or page lookup, forcibly expanding CamelCasePageNames (and inline page links)
so that they have one space in between the words? That way,
case-insensitive matching with spaces preserved (trimmed to one space)
would work.

So, the rules would be this:

(1) When links in pages are parsed, or page names are saved, leading and trailing spaces will be trimmed, and all whitespace between words
will be replaced with one space character.
(2) Whitespace before and after the space name will be removed.
(3) CamelCase page links or page names will be normalize by adding a
space before each uppercase letter that starts a word
(4) Tests for page name equality are done by applying rules (1) , (2)
and (3) and making a case-insensitive comparison.

That seems simple enough, no?

Andrew

On Mon, Nov 2, 2009 at 2:44 PM, Janne Jalkanen <[email protected] >
wrote:
Can you provide some examples where a
strip-the-whitespace-and-do-a-case-insensitive-comparison strategy would not work, in Finnish? I'd like to understand this, seriously.

E.g. "maan alle" vs "maanalle". First means "into the ground", the
next one is "earth bear".

Or "kuusi puuta" vs "kuusipuuta" - "six trees" vs "at a fir" (or "of
fir timber").

Or simply "sivusta katsoja" vs "sivustakatsoja" - "a person who looks (literally) from the sides" vs "onlooker". The difference is subtler
than with the previous ones, but the existence of the space is
significant information.

In fact, getting mixed up when two words go together and when they do
not is one of the most common grammatical errors.  Sometimes the
results can be fairly hilarious and unintended. Often it looks just
sad.

But the point being that in Finnish (and other so-called constructed languages), whitespace is significant. So it should not be ignored
arbitrarily.

Besids, I am not aware of any wikiengines who would consider
whitespace insignificant in determining pagename equality.
mediawiki's
rules concerning spaces are:

<snip>
Spaces/underscores which are ignored:
* those at the start and end of a full page name
* those at the end of a namespace prefix, before the colon
* those after the colon of the namespace prefix
* duplicate consecutive spaces
<snap>

FYI, I took a look at JSPWiki.org to see what the scale of the
problem
might be. The site has about 4850 pages. I yanked down all of the
page
names and compared them. I detected exactly ONE name clash: "Text formatting rulesKorean" and "TextformattingrulesKorean" appear to be different pages. That is a 0.02% collision rate -- and easily handled
by a rename-on-import or special-page redirection strategy.

That's not what I meant. I meant that we have many links of the form [word1 word2] embedded within running text. If we change those, then
the running text becomes meaningless and needs to be *checked by
hand*.

/Janne





Reply via email to