Re: WikiName normalization

Janne Jalkanen Sun, 28 Dec 2008 02:55:03 -0800

So does 35 fails and 7 errors sound right for the 2.8 branch? I don't
have RCS setup.

On my computer, all of them run. So you still probably have someproblems. Not having installed RCS would cause those tests to fail,but certainly not others.

That's fine, but it makes no attempt to help with "Test some name",
"Test some Name" and "Test SomeName" being treated as different pages.

Here's a problem: it *does* happen on some platforms. The page namesare partly case-sensitive on a platform-dependent basis, and thatcan't be helped outside of ripping out the entire backend andreplacing it with something more sane.

Um, how does it lose information?  It *adds* spaces (fairly nicely
too[1]). What information does it *lose* (Maybe I'm being dense,can you
give me an example?)
[1] The only corner case I've ever noticed that bothered me is"PDFs Are
Nice"  which turns out as "PD Fs Are Nice".

That would be a good example of where it loses information - PDF is asingle word, and it arbitrarily removes that information.

The allowed punctuation chars " ()&+,-=._$" greatly raises the
complexity (and flexibility) of WikiNames.  Again, "Test(Name)" and

"Test (Name)" are two different pages as is "Test 2+2=4" and "Test2 + 2

= 4".  These punctuation chars could have rules for normalization
expressed easily for English, but I'm completely unsure how those rule
would work for other languages (the decimal separator rule would at
least need to be platform based).

Which is exactly why I think the only sane normalization would be tocompress spaces in the sense of


1..N spaces => 1 space
0 spaces = 0 spaces

with no other normalizations. If we can figure out a way to get ridof any other normalizations, that would be great (but I don't thinkthat's really possible). This includes English plurals, beautifyString(), etc.

Note that Wikipedia works well with punctuation characters. They arejust titles.

In wikis the link should always be equal to the title of the page.The reason why we're having this discussion is the unfortunatedecision I made a long time ago to allow freelinks to map toCamelCase names. If that had not been done, there would be noproblems whatsoever.

The 2.8 branch's JSPWikiMarkupParserTest has (8) failures as it is in
svn, they appear to be "%20" related in some checked URLs.  I assume
these were known and accepted?


Nope.  They all run 100% for me. Otherwise we wouldn't have released ;-)

I think before you hack anything, you should probably check what isgoing on...

JSPWiki 2.8 and all earlier versions:
1) are Case sensitive when it comes to wiki page names. ("Test name"
isn't same page as "Test Name")


They are partly case sensitive.

2) Allow spaces in name to differentiate pages ("Test SomeName" isn't
same page as "Test Some Name".)

Yes, but on some platforms "Test Somename" and "Test SomeName" areequal.

It is a good question on which behaviour we should standardize on. Ithink I prefer the case insensitivity.

I chimed in on this normalization stuff because you mentionedcreating a

WikiName class or some such a while back.  Looking thru the codebase
yesterday and today shows a zillion places where the paradigm "String
pageName" is used.  The testcases especially have hardcoded page names
in them and the tests in many cases dip under the covers for setup &
scaffolding work...  Ick, but fixable.

There's actually a good reason for a lot of that stuff; it's fast towrite the tests, and also, they try to isolate the components so thatany failures in other components would not affect the current test.

There is one area that is hard to unit test and that deals withhandling
"legacy" pages in the providers repository.  For instance, this work
shows that a user can have multiple pages on disk for the *same*
normalized wiki name.


Correct.  Which is a problem.

How should this be handled on a moving forward basis? I think it*has*
to be handled, because I think case sensitive wikinames are too
confusing to casual users. I think AbstractFileProvider.findPage()is a
place where this could be handled.  But I am unwilling to proceed
further without input from the dev team.

I personally think that case insensitiveness is the way to go.Unfortunately, that means that title beautification has to go, simplybecause it would mean that

"Test Somename", "Test SomeName", and "Test Some Name" would need tobe equal, BUT it also means that "Testsomename", "test So Me Nam E","Test Som Ename", "Test S Omename" and all the other possiblevariants would need to be considered equal too. This is just toomuch variance, IMHO.

!!!Proposal:
JSPWiki user-visible page names should be clean & normal __and__ allow
spaces in them.

JSPWiki internal page names should be clean & normal and __not__ have
spaces in them.

I don't think this is simply possible due to the above limitation.It means that all pagenames should be stored in lowercase, space-compressed in the repository (i.e. "testsomename"), since JCR is casesensitive. Which means that beautifyString() cannot have any capitalletters to work with, unless we start storing the page title outsideof its WikiName, which is of course possible, but kinda against Wikiideals.

BTW, this would then also have to be true for attachments as well,since in 3.0 they are treated exactly like pages.

Is the above proposal tracking toward what you wanted?  Or do you want
something more prose-like?  Basically this would be putting

beautifyString() on steroids. Oddly though, it gets used to breakapart

names and add capitalization, but then the spaces get stripped right
back out.

Beautifystring() is a problem for us Finns, since it guesses theproper capitalization wrong all the time. In Finnish, headlines don'thave Every First Letter In Capital, but we would write "Every firstletter in capital".

I think that it might be better to stop to guess what the user wants,and just be as simple as possible. Get rid of our overly complicatednormalization, and just keep links from breaking.


/Janne

Re: WikiName normalization

Reply via email to