setLoadStripWhitespace() api errors when trimming white space characters
------------------------------------------------------------------------
Key: XMLBEANS-295
URL: http://issues.apache.org/jira/browse/XMLBEANS-295
Project: XMLBeans
Issue Type: Bug
Components: Validator
Affects Versions: Version 2.2.1
Environment: SunOS 5.9 and Microsoft Windows XP SP2, Java 1.4.2
Reporter: David RR Webber
Fix For: TBD
Situation Summary
We implemented to production using the setLoadStripWhitespace() api in
XMLBeans. After some days we started getting intermittent failures from
occasional XML transactions.
After a week of investigation we realized that flushText() method itself was
the cause - having eliminated all other factors. Specifically we have
determined that character strings containing the & character result in spaces
being stripped immediately after the & - e.g. <company>B & H Photo</company>
becomes <company>B &H Photo</company>.
We realize that there is a patch available for & processing - and we are
currently testing that to see if is cures the problem relating to &
(http://issues.apache.org/jira/browse/XMLBEANS-274 )
However we are also seeing an intermittent problem in our UNIX environment
associated with colon : (could be other characters as well - we do not have
definitive list). What we found is intermittent spaces being trimmed in various
fields that do not contain "&" (the original XMLBEAN-274 bug reported). This
one we cannot reproduce in our Windows development systems - but it is
happening intermittently in SunOS.
Again space either immediately following the colon or in subsequent string is
stripped - for tokenized elements - e.g. <urgent>Yes: Y</urgent> becomes
<urgent>Yes:Y</urgent> and then the object returns NULL value because this is
then not a valid allowed value for the tokenized list. Similarly <location>USA:
United States</location> became <location>USA: UnitedStates</location>. We
suspect that there is a prior character before the colon that might be
triggering this behaviour but we have not yet determined when or how. This
illustrates how complex this issue is in terms of the current XMLBeans
implementation approach.
Analysis
We have looked at how and where XMLBeans is doing the white space trim during
the unmarshalling of the XML content. When it detects a white space - it then
invokes a stripRight() method loop. We are not convinced that this is
architecturally sound at the point it is employed - it is leading to complexity
and obviously a lot of edge conditions and some combinations of characters that
are not handled consistently and correctly.
Our preferred approach would be to defer the white space trim until
post-unmarshalling - so the initial process can treat the XML content "as is"
between the angle brackets - then once extracted - then apply the trim(). At
that point a simple java string object trim() can be employed. This could be
provided as an alternate method call to the current setLoadStripWhitespace()
api that would iterate through the entire structure of objects instead of the
original XML stream. The only check that would be necessary is if the XML
markup itself set the xml:space="preserve" attribute option for an element
object - in which case the trim() would be automatically skipped for that
content object item. What is happening right now is that the existing
flushText() method is mixing up XML markup and the content - instead there
needs to be a clear separation between the element angle brackets and attribute
quotes - and the content itself.
Again the caveat maybe here - maybe the current approach is intended to be
prior to error checking on tokenized lists - to prevent failure there due to
extra spaces? However - even so it is not cleanly enough separated - and
clearly again it would be simpler to use a java string class trim method within
the tokenized evaluation itself on just the string.
Suggested Solution
Re-factor the current white space setLoadStripWhitespace() api to delay string
manipulation on content until after unpacking of the content and XML markup -
instead of prior-to as is currently happening. This makes for much simpler
white space trim logic (can simply use the Java string class method) that does
not need to look for markup artifacts as well.
We are not clear on who owns this particular feature in XMLBeans - whether they
are currently available to assist on this - but we would be prepared to work
with the team to develop a better solution here.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]