Re: 2 last failing unit tests

Janne Jalkanen Mon, 09 Nov 2009 14:17:19 -0800

Ehm. So "ThisIsAPage" => "This Is APage", "XRay" => "XRay" (and not"X Ray") and "PresidentUThant" => "President UThant"?

I fail to see where this becomes easier than the current method, whichis fairly straightforward:

1) Clean the page first by collapsing all spaces so that only onespace remains per whitespace sequence. Find a page by that name.

2) IFF the page in 1) did not exist, collapse the rest of the spacesaway, then try finding that page. This keeps it compatible with theCamelCase syntax and the pre 2.6 -method. We can make this an optionin 3.0, and for new installs, turn it off so they will not create alegacy.

The beatifyTitle() kind of attempts to do exactly this, but it's not aparticularly efficient and is known to break in several cases.

I also think changing the page names upon import is a bad idea, sinceit will e.g. break certain types of search and possible scripts whichmight rely on pages named in a certain pattern. I'd say let's keepthe page names as-is, since it's likely that the users have chosen thepagenames with a particular purpose in mind, and I'd rather not goaround second-guessing them.


/Janne

On Nov 5, 2009, at 18:21 , Andrew Jaquith wrote:

Yes, that is right. While the examples do look a bit odd, I'd pointout that MYPAGE in all-caps is an acronym.
There is one refinement we could make to rule (3): add a space whenan uppercase letter follows a lowercase letter. so PageJSPWiki wouldexpand to Page JSPWiki.
That is still fairly simple, while producing good results.

On Nov 5, 2009, at 0:49, Harry Metske <[email protected]> wrote:
so that would mean for example :

[MYPAGE] => [ M Y P A G E ]
[IPPhone]   => [ I P Phone]
[mypagE] => [mypag E]

looks a bit odd to me


2009/11/5 Andrew Jaquith <[email protected]>
I'd define it as "an uppercase latter that follows a non-whitespace
character."
On Wed, Nov 4, 2009 at 2:52 PM, Harry Metske<[email protected]>
wrote:
agreed on the 1) and 2)

But how exactly do you define "adding a space before each uppercase
letter
that starts a word" ?
How do you find this "uppercase letter that starts a word" in apagename
or
link ?
Can you give a few samples ?

/Harry

2009/11/2 Andrew Jaquith <[email protected]>
Ok, that makes sense. I can think of cases in English too, like
"averse" (opposed to) and "a verse" (a portion of a song orpoem). I
just decided that I didn't care. :)

But assuming we do care...
...what about going the other way: on import, or on page save,or pagelookup, forcibly expanding CamelCasePageNames (and inline pagelinks)
so that they have one space in between the words? That way,
case-insensitive matching with spaces preserved (trimmed to onespace)
would work.

So, the rules would be this:
(1) When links in pages are parsed, or page names are saved,leadingand trailing spaces will be trimmed, and all whitespace betweenwords
will be replaced with one space character.
(2) Whitespace before and after the space name will be removed.
(3) CamelCase page links or page names will be normalize byadding a
space before each uppercase letter that starts a word
(4) Tests for page name equality are done by applying rules(1) , (2)
and (3) and making a case-insensitive comparison.

That seems simple enough, no?

Andrew
On Mon, Nov 2, 2009 at 2:44 PM, Janne Jalkanen <[email protected]>
wrote:
Can you provide some examples where a
strip-the-whitespace-and-do-a-case-insensitive-comparisonstrategywould not work, in Finnish? I'd like to understand this,seriously.
E.g. "maan alle" vs "maanalle". First means "into the ground",the
next one is "earth bear".
Or "kuusi puuta" vs "kuusipuuta" - "six trees" vs "at afir" (or "of
fir timber").
Or simply "sivusta katsoja" vs "sivustakatsoja" - "a person wholooks(literally) from the sides" vs "onlooker". The difference issubtler
than with the previous ones, but the existence of the space is
significant information.
In fact, getting mixed up when two words go together and whenthey do
not is one of the most common grammatical errors.  Sometimes the
results can be fairly hilarious and unintended. Often it looksjust
sad.
But the point being that in Finnish (and other so-calledconstructedlanguages), whitespace is significant. So it should not beignored
arbitrarily.

Besids, I am not aware of any wikiengines who would consider
whitespace insignificant in determining pagename equality.
mediawiki's
rules concerning spaces are:

<snip>
Spaces/underscores which are ignored:
* those at the start and end of a full page name
* those at the end of a namespace prefix, before the colon
* those after the colon of the namespace prefix
* duplicate consecutive spaces
<snap>
FYI, I took a look at JSPWiki.org to see what the scale of the
problem
might be. The site has about 4850 pages. I yanked down all ofthe
page
names and compared them. I detected exactly ONE name clash:"Textformatting rulesKorean" and "TextformattingrulesKorean" appearto bedifferent pages. That is a 0.02% collision rate -- and easilyhandled
by a rename-on-import or special-page redirection strategy.
That's not what I meant. I meant that we have many links ofthe form[word1 word2] embedded within running text. If we changethose, then
the running text becomes meaningless and needs to be *checked by
hand*.

/Janne

Re: 2 last failing unit tests

Reply via email to