[dom4j-dev] [ dom4j-Bugs-1116471 ] Problem with XPath and retrieving text

SourceForge.net Sat, 30 Dec 2006 03:49:21 -0800

Bugs item #1116471, was opened at 2005-02-04 13:06
Message generated for change (Comment added) made by nobody
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=116035&aid=1116471&group_id=16035


Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Steve Carter (cart33)
Assigned to: Maarten Coene (maartenc)
Summary: Problem with XPath and retrieving text

Initial Comment:
I have a Junit test similar to the following:

public void test() {

      fiinal String XML = "<a><b>Water T &amp;
D-46816</b></a>";
      final String XPATH = "a/b/text()";
      final String EXPECTED_VALUE = "Water T & D-46816";

      XPath xpathObj = createXpathObject(XPATH );
      Document doc = createDocument(XML );
      Object node = xpathObj.selectSingleNode(doc);

        if (node instanceof Text) {
            result = ((Text) node).getText();
        }
       
        assertEquals(EXPECTED_VALUE, result));
}

which fails because getText() only returns: Water T

interrogating the node object returned from
selectSingleNode indicates that the expected result is
present as 3 seperate text elements in the content
(ArrayList) member variable

I can retrieve the value if I tweak the approach to use:
 
    final String XPATH = "a/b";

     if (node instanceof Element) {
            return  (String) ((Element) node).getData();
        }

If i dont have entity references then the first
approach always works. Therefore this seems to be a
bug, please correct me if i am wrong.


----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2006-12-30 03:48

Message:
Logged In: NO 

These crazy bitches are ready to do anything from deepthroat to DP to milk
a fat cock or two at once: <a
href="http://www.porn-active.info/quebfreeamatteengirlsgett.html";>quebec
free amateur teen girls getting fucked</a>.

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2006-12-29 06:54

Message:
Logged In: NO 

Are you infatuated with cum-addicted <a
href="http://www.porn-and-sex.info/realchubbyamatporntrail.html";>real
chubby amateur porn trailers</a> chicks? Here <a
href="http://www.porn-and-sex.info/realfreelinksvid.html";>real free links
video amateur porn movies</a>, <a
href="http://www.porn-and-sex.info/realamatsexamat613.html";>real amatuer
sex, amateur</a> you’ll find as many cock-starving whores as it’s only
possible. Watch these dick-smokers getting enormous throbbing poles deep in
their mouths.

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2006-12-23 02:02

Message:
Logged In: NO 

They are yummy quick-learners getting it in every hole and riding
teachers&#039; huge cocks. <a
href="http://www.fuck-teen-princesses.info/";>naked sitting on
face</a>Freaky teen school sluts are ready to do anything for extra mark!

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2006-11-10 06:39

Message:
Logged In: NO 

The widest collection of the best porn in the Internet for any taste! <a
href="http://www.pornoerotica-xxx.com/bigcockpen.html";>big cock
penetration</a>Every nyche is loaded with hours of the HOT PORN!

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2006-10-15 10:41

Message:
Logged In: NO 

Super sexy girls <a
href="http://www.europe-xxx.info/europandgermandsexthurmbn.html";>europeans
and germans and sex thurmbnails</a> wait for your attention. Recommended 
<a href="http://www.europe-xxx.info/europyounggirlsworkout.html";>european
young girls workout gallery</a> by me ))) and this: <a
href="http://www.europe-xxx.info/holsexviol.html";>holiday sex violence and
the european dream</a>

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2006-10-12 04:07

Message:
Logged In: NO 

Erotic and the night action!!!<a
href="http://www.amateur-fuck.info/drunkamatgirlspanty.html";>drunk amateur
girls panty</a> AMATEUR VIDS:
<a href="http://www.amateur-fuck.info/cuptitsass.html";>cup tits ass wife
amateur</a> and <a
href="http://www.amateur-fuck.info/eatamateurwomen.html";>eat amateur women
locker room</a>!!

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2006-04-20 22:31

Message:
Logged In: NO 

Hi
To write the letter, it is necessary ...

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2006-04-04 16:38

Message:
Logged In: NO 

I want mp3 player. What will advise?

----------------------------------------------------------------------

Comment By: Lukas Theussl (lukas_theussl)
Date: 2006-04-03 16:55

Message:
Logged In: YES 
user_id=1301221

Hi Maarten,

I have just re-built Maven-1.1 using dom4j from the
DOM4J_1_X_BRANCH and together with jaxen-1.1-beta-8, it
seems to solve the problems that I reported at
http://jira.codehaus.org/browse/JAXEN-67 !

This is great news for us, as upgrading dom4j and jaxen has
been a long-standing blocker in Maven (see
http://jira.codehaus.org/browse/MAVEN-1345). I still have to
do some more thorough testing, but is there any chance that
we could have a stable release soon with this fix included?

Thanks!
-Lukas

----------------------------------------------------------------------

Comment By: Maarten Coene (maartenc)
Date: 2006-03-24 14:05

Message:
Logged In: YES 
user_id=178745

Bazza,

I've modified SAXContentHandler to also merge the CDATA
sections if you set mergeAdjacantText to true.

Could you please try again with the version from CVS?
(branch DOM4J_1_X_BRANCH)

thanks
Maarten

----------------------------------------------------------------------

Comment By: Victor (kromo)
Date: 2006-02-21 16:55

Message:
Logged In: YES 
user_id=1156663

I encountered the same problem. In my case there were no
entities but a buffer boundary which created mismatches.
I used something like //serialNumber/text() to collect all
serial numbers but one of these was splitted into to
separated text() nodes.
It makes a big difference eg. if one has something like
"count(//serialNumber)" or "count(//serialNumber/text())"
because these two number may be not equal even if
<serialNumber> contains only PCDATA.
Buffer boundaries should have no influence on the model.

----------------------------------------------------------------------

Comment By: Bazza (bazzargh)
Date: 2005-12-22 04:05

Message:
Logged In: YES 
user_id=1005507

(came here from a related bug report filed against jaxen,
see http://jira.codehaus.org/browse/JAXEN-67 )

Maarten, I think there's a legitimate bug here:
/any/xpath/text() should only return multiple nodes for
mixed content, not just when there are entities present. eg:

<a>this<b>has</b>two</a>
Should return 2 for count(/a/text()); and with mixed content
the stringValue of '/a' is not the same as '/a/text()'
(referring to your workaround above)

<a>this hasn&apos;t one</a>
should return 1 for the same expression (going by the xpath
spec). 

Also:
<a>this <![CDATA[has]]> one</a>

Should return 1. People using xpath with dom4j need to use
normalize() to work around this whenever node() or text()
appear in their expressions. Unfortunately the
'setMergeAdjacentText' method at parse time, which would
appear to 'pre-normalize' the tree, doesn't. In
SAXContentHandler (copying and pasting from my comments on
JAXEN-67 ):

inside 'characters()', this code:
} else if (insideCDATASection) {
if (mergeAdjacentText && textInTextBuffer) {
completeCurrentTextNode();
}

cdataText.append(new String(ch, start, end));
} else {

... means that even if you've asked it to merge adjacent
text nodes, it goes ahead and builds cdata nodes; which it
then adds without checking the 'mergeAdjacentText' flag:

public void endCDATA() throws SAXException {
insideCDATASection = false;
currentElement.addCDATA(cdataText.toString());
}

To my mind, these should read, respectively:
} else if (insideCDATASection && !mergeAdjacentText) {
cdataText.append(new String(ch, start, end));
} else {
...
public void endCDATA() throws SAXException {
// you'd want this condition around the code in startCDATA too.
if (!mergeAdjacentText) {
insideCDATASection = false;
currentElement.addCDATA(cdataText.toString());
}
}

This would make 'mergeAdjacentText' normalize as it goes,
which I'm guessing was the desired behaviour?

----------------------------------------------------------------------

Comment By: Michael Pichler (mpichler)
Date: 2005-12-16 05:31

Message:
Logged In: YES 
user_id=613551

I stand corrected. The spec says that adjacent Text nodes
should be merged automatically. Thus the normalize() call
is a workaround (but at least, it should work).


----------------------------------------------------------------------

Comment By: Michael Pichler (mpichler)
Date: 2005-12-16 05:25

Message:
Logged In: YES 
user_id=613551

Hi,

I think this is perfectly normal. There are multiple
text() children which may be addressed separately with
xpaths containing indices (see bug 1374352).

Your problem is that selectSingleNode() only selects the
first matching text child, and it seems you should call
normalize() on the root element first to "merge" adjacent
Text nodes before any further processings.

regards,
Michael Pichler


----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2005-04-13 19:14

Message:
Logged In: NO 

I have just hit this bug (in production of course ;-).
Originally using dom4j 1.4, but still present using dom4j
1.5.2, jaxen 1.0FCS.

My xpath is of the form "//a/b[text()="value"]/..".
This failed in one case because the 'b' element has been
parsed into two 'text' nodes. It seems it crossed some
buffer boundary in the parsing stage, as the two text values
are "TINBICS_SECOND" and "ARY_FEC" (i.e. just normal text).
In our other test cases this has been parsed as a single
text node. I verified the arbitrary splitting by adding
spaces earlier in the file, and the position of the split
moved accordingly "TINBICS_SEC" and "ONDARY_FEC".

Replacing the xpath with "//a[b="value"]" solved the problem,
so this seems to be a problem with using "text()" in the xpath.

The xpath spec says there should never be two adjacent text
nodes.
http://www.w3.org/TR/xpath#section-Text-Nodes

Second, the xpath spec says that 'text()' should select all
text nodes.
http://www.w3.org/TR/xpath#path-abbrev

I'm not sure if dom4j is "at fault", but it sure would be
nice if it could at least be resilient to the problem.

:-)

Andrew.


----------------------------------------------------------------------

Comment By: Steve Carter (cart33)
Date: 2005-02-12 20:06

Message:
Logged In: YES 
user_id=597933

Thanks for the explanation. Greatly appreciated. I have not
made myself familiar with the specification so I appreciate
your insight. It  just seemed intuitive to me that
selectSingleNode() would return the full value of the node
whether references were present or not. Feel free to close
this issue and pursue it as an enhancement as there are many
approaches to satisfy the solution. I enjoy using your api
and thanks again for the help. 

----------------------------------------------------------------------

Comment By: Maarten Coene (maartenc)
Date: 2005-02-12 07:04

Message:
Logged In: YES 
user_id=178745

On the other hand, I see now in the 5.7 of the XPath spec
that a text node shouldn't have immediately following
siblings that are text nodes themselfs, so this could be a
bug indeed.

I'll investigate this further...

regards,
Maarten

----------------------------------------------------------------------

Comment By: Maarten Coene (maartenc)
Date: 2005-02-12 06:57

Message:
Logged In: YES 
user_id=178745

I don't think this is a bug.

The following happened:
expression "a/b/text()" selects all text nodes of <b>.
Because you have an entity reference in it, the SAX parser
you have used did create 3 text nodes: "Water T ", "&" and "
D-46816". The selectSingleNode() method returns the first
node: "Water T ". So this is correct.

expression "a/b" selects all <b> elements. If you apply the
string function to it, you will retrieve the string-value of
the <b> element. This expression should do the trick:
"string(a/b[1])", as illustrated by the example below:

String xml = "<a><b>Water T &amp; D-46816</b></a>";
Document doc = DocumentHelper.parseText(xml);
String result = (String) doc.selectObject("string(a/b[1])");

now, result is equal to "Water T & D-46816"

Another way is to retrieve the node and ask for the
string-value directly on the node:

Node node = doc.selectSingleNode("a/b");
String result = node.getStringValue();

I hope this helped you out. If you still feel this is a bug,
please tell me otherwise I'll close this issue.

regards,
Maarten

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2005-02-10 09:49

Message:
Logged In: NO 

This problem affects other xpath query types sch as /a/b/*
etc...

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=116035&aid=1116471&group_id=16035

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
dom4j-dev mailing list
dom4j-dev@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dom4j-dev

[dom4j-dev] [ dom4j-Bugs-1116471 ] Problem with XPath and retrieving text

Reply via email to