RE: WikiName normalization

Volkar, John M. Tue, 23 Dec 2008 10:52:49 -0800

Janne,

> They *do* run.  Did you run "ant tests" or "ant guitests"?


Huh, neither, just straight in Eclipse.  Got it now. Thanks.

 

> Certainly too many errors.  Try running "ant tests".

Okay so 2.8 (unmodified):
* "ant tests"     Tests: 952, Failures: 35 , Errors: 7

So does 35 fails and 7 errors sound right for the 2.8 branch? I don't
have RCS setup.



> The thinking is that page names with spaces are the norm - and, if the

> page does not exist, JSPWiki checks for "old style" pagenames (pre
2.6).

That's fine, but it makes no attempt to help with "Test some name",
"Test some Name" and "Test SomeName" being treated as different pages.  


>> TextUtil.beautifyString could be employed in .cleanLink to make it 
>> commutative...

> Probably not a good idea, since beautifyString() loses information.

Um, how does it lose information?  It *adds* spaces (fairly nicely
too[1]). What information does it *lose* (Maybe I'm being dense, can you
give me an example?)

[1] The only corner case I've ever noticed that bothered me is "PDFs Are
Nice"  which turns out as "PD Fs Are Nice".

Please provide an example of a case where it loses information.




> I would like to see a proposal written out

Okay.


> don't get any surprises (e.g. unreachable pages) - remember that
> 2.6 has been out for quite a while, so there are quite a few sites 
> already out there for whom we no longer can break the links.

That is a given, it should "just work" from the users perspective no
matter what kind of names they got used to using or how exactly things
are named in their page provider repository.  

The allowed punctuation chars " ()&+,-=._$" greatly raises the
complexity (and flexibility) of WikiNames.  Again, "Test(Name)" and
"Test (Name)" are two different pages as is "Test 2+2=4" and "Test 2 + 2
= 4".  These punctuation chars could have rules for normalization
expressed easily for English, but I'm completely unsure how those rule
would work for other languages (the decimal separator rule would at
least need to be platform based).




The normalization as I've hacked cleanLink "feels like" it is moving in
the right direction.  The failures and errors that this modification
causes need fixed of course.  

The 2.8 branch's JSPWikiMarkupParserTest has (8) failures as it is in
svn, they appear to be "%20" related in some checked URLs.  I assume
these were known and accepted?


<thinking/>

I've noted that it is the presence of spaces in the cleanedLink that
seem to cause most of the test failures for me, what if they were simply
filtered out by cleanLink?  Meaning "clean" links will have NO spaces in
them.  Users can still of course reference everything with spaces, but
internally link names are spaceless.

Let me try this... <working/>
 

Okay so with:
1) the tweaked, spaceless cleanLink and 
2) a call to it in the WikiPage constructor to clean the name
3) a tweak to Attachment to deal with cleaning parentPage name.
4) adding a few calls to cleanLink in:
4a) PageManager where Strings rather than WikiPage objects are used.
4b) WeblogPlugin (of all places!)

The "ant tests" result in Tests:952, Failures:36, Errors:8 

This is much better (was 35 and 7 for me).  

Investigating these few remaining shows that most of the failures and
errors are due to:
1) Assumptions in the test cases about what url string should be
returned (%20 stuff), and
2) Case changes in the normalized page name, e.g. "Foobar2s" is cleaned
to "Foobar2S" 

If test cases were fixed up for case & space issues only I think this
would be just fine.

Funny, this now gives %20less URLs without any other changes (I think?).


Point summary:
JSPWiki 2.8 and all earlier versions:
1) are Case sensitive when it comes to wiki page names. ("Test name"
isn't same page as "Test Name")
2) Allow spaces in name to differentiate pages ("Test SomeName" isn't
same page as "Test Some Name".)
3) But 1 and 2 depend on the order of page creation (if CamelCase is
made first it's okay?).

I don't like 1 and 2, and 3 makes it goofy to figure out the rules.  

I chimed in on this normalization stuff because you mentioned creating a
WikiName class or some such a while back.  Looking thru the codebase
yesterday and today shows a zillion places where the paradigm "String
pageName" is used.  The testcases especially have hardcoded page names
in them and the tests in many cases dip under the covers for setup &
scaffolding work...  Ick, but fixable. 


I currently stand at:
* "ant tests"     Tests:952, Failures:36, Errors:8   was (35 & 7)



There is one area that is hard to unit test and that deals with handling
"legacy" pages in the providers repository.  For instance, this work
shows that a user can have multiple pages on disk for the *same*
normalized wiki name.

How should this be handled on a moving forward basis?  I think it *has*
to be handled, because I think case sensitive wikinames are too
confusing to casual users.  I think AbstractFileProvider.findPage() is a
place where this could be handled.  But I am unwilling to proceed
further without input from the dev team.

So, now what?  A test case and a proposal follow.


Regards,
John Volkar
----
Add this to JSPWikiMarkupParser...  It passes with the new cleanLink
that I'm working with.

    /**
     * This FAILS with 2.8, since it treats "Test name" as different
from "TestName". 
     * <p> 
     * NOTE: If "TestName" is created first, this will all pass.
     */
    public void testLinkCleaning() throws Exception
    {
       newPage( "Test name" ); // passes if "TestName".       
       String expectHead = "This is a <a class=\"wikipage\"
href=\"/Wiki.jsp?page=TestName\">";
       String expectTail ="</a> to try.";
       String src = null;
       
       src = "This is a TestName to try.";
       assertEquals(expectHead + "TestName" + expectTail,
translate(src));
       
       src = "This is a [Test Name] to try.";
       assertEquals(expectHead + "Test Name" + expectTail,
translate(src));
       
       src = "This is a [Test name] to try.";
       assertEquals(expectHead + "Test name" + expectTail,
translate(src));
       
       src = "This is a [test Name] to try.";
       assertEquals(expectHead + "test Name" + expectTail,
translate(src));
       
       src = "This is a [test name] to try.";
       assertEquals(expectHead + "test name" + expectTail,
translate(src));
       
       src = "This is a [silly test|TestName] to try.";
       assertEquals(expectHead + "silly test" + expectTail,
translate(src));
       
       src = "This is a [silly test|Test Name] to try.";
       assertEquals(expectHead + "silly test" + expectTail,
translate(src));
       
       src = "This is a [silly test|Test name] to try.";
       assertEquals(expectHead + "silly test" + expectTail,
translate(src));
       
       src = "This is a [silly test|test Name] to try.";
       assertEquals(expectHead + "silly test" + expectTail,
translate(src));
       
       src = "This is a [silly test|test name] to try.";
       assertEquals(expectHead + "silly test" + expectTail,
translate(src));
   }


!!!Proposal:
JSPWiki user-visible page names should be clean & normal __and__ allow
spaces in them.

JSPWiki internal page names should be clean & normal and __not__ have
spaces in them.

Where "clean" means that only letters, digits and a limited set of
punctuation characters are allowed to be present in the name.

Where "letters" and "digits" are those indicated as such by the java
Character class.

Where "limited set of punctuation" are only those given by
MarkupParser.PUNCTUATION_CHARS_ALLOWED but __excluding__ spaces.

<fuzzy>
Where "normal" means that all words in a name are in a __consistently
applied__ "Title Case" word form.

Where "words" are the resulting space separated elements returned by
beautifyString.

NOTE: beautifyString could be enhanced for the new punctuation
characters, it's currently "dumb" to them, maybe rules like...
* '(' always preceded by a space, never follow by a space
* ')' always follow by a space, never preceded by a space
* ',.' always follow by either space, digit or ')', never preceded by
space
* '$' always preceded by a space
* '_' never preceded or followed by a space.
* '&+-=' always precede and follow by spaces.
</fuzzy>
----


Is the above proposal tracking toward what you wanted?  Or do you want
something more prose-like?  Basically this would be putting
beautifyString() on steroids.  Oddly though, it gets used to break apart
names and add capitalization, but then the spaces get stripped right
back out.  


Apologies for the *long* email.

RE: WikiName normalization

Reply via email to