RE: Problem with well-formed XML docs

2006-07-31 Thread Andre Basse
Updated to the latest build. - Problem solved.

Thanks for your help!!!




-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Sent: Friday, 28 July 2006 5:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Problem with well-formed XML docs


Andre, which Appserver are you using to run Solr? ... there have been several 
reports of bugs with the way Jetty deals with the the XML escaped output 
produced by Solr, particularaly when non ascii characters are involved.

If you are using a version of Jetty, have you tried using a build more recent 
then July 17th when this patch was applied...

http://issues.apache.org/jira/browse/SOLR-32

?

: Date: Fri, 28 Jul 2006 17:09:51 +1000
: From: Andre Basse <[EMAIL PROTECTED]>
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Problem with well-formed XML docs
:
: Hi all,
:
:
: I have imported some XML documents to Solr. However when I do a query for 
certain documents I get following error message in the browser:
:
: XML Parsing Error: not well-formed
: Location: 
http://192.168.32.128:8983/solr/select/?stylesheet=&q=cat%0D%0A&version=2.1&start=0&rows=10&indent=on
 

: Line Number 149, Column 185:An unusual Wemyss tabby cat, 32cm high, sporting 
a broad grin and inset green-glass eyes, caused a mild sensation on January 27 
when it fetched £23,900 at Edinburgh auctioneers Lyon & Turnbull. This was 
more than four times the upper estimate. Among a sizeable line-up of Wemyss 
porkers, a seated pig just 16cm long and painted with shamrocks fetched £4780, 
almost 10 times its estimate.
: 
^
:
:
: The error message is pointing to the & char in the result.
:
:
:
: This is the part of my source XML document that shows that the "&" is 
well-formed before import:
:
: "...when it fetched £23,900 at Edinburgh auctioneers Lyon & Turnbull. 
This was more than four .."
:
:
: Any idea?
:
:
: Any help is much appreciated!
:
:
:
:
: Thanks,
:
: Andre
:
:
: 
*
: The information contained in this e-mail message and any accompanying files 
is or may be confidential.  If you are not the intended recipient, any use, 
dissemination, reliance, forwarding, printing or copying of this e-mail or any 
attached files is unauthorised. This e-mail is subject to copyright. No part of 
it should be reproduced, adapted or communicated without the written consent of 
the copyright owner. If you have received this e-mail in error, please advise 
the sender immediately by return e-mail, or telephone and delete all copies. 
Fairfax does not guarantee the accuracy or completeness of any information 
contained in this e-mail or attached files. Internet communications are not 
secure, therefore Fairfax does not accept legal responsibility for the contents 
of this message or attached files.
: 
*
:
:



-Hoss



Re: Doc add limit

2006-07-31 Thread sangraal aiken

Just an update, I changed my doUpdate method to use the HTTPClient API and
have the exact same problem... the update hangs at exactly the same point...
6,144.

 private String doUpdate(String sw) {
   StringBuffer updateResult = new StringBuffer();
   try {
 // open connection
 log.info("Connecting to and preparing to post to SolrUpdate
servlet.");
 URL url = new URL("http://localhost:8080/update";);
 HTTPConnection con = new HTTPConnection(url);
 HTTPResponse resp = con.Post(url.getFile(), sw);

 if (resp.getStatusCode() >= 300) {
   System.err.println("Received Error: " + resp.getReasonLine());
   System.err.println(resp.getText());
 } else {
   updateResult.append(new String(resp.getData()));
 }

 log.info("Done updating Solr for site " + updateSite);
   } catch (IOException ioe) {
 System.err.println(ioe.toString());
   } catch (ModuleException me) {
 System.err.println("Error handling request: " + me.getMessage());
   } catch (Exception e) {
 System.err.println("Unknown Error: " + e.getMessage());
   }

   return updateResult.toString();
 }

-S

On 7/31/06, sangraal aiken <[EMAIL PROTECTED]> wrote:


Very interesting... thanks Thom. I haven't given HttpClient a shot yet,
but will be soon.

-S


On 7/31/06, Thom Nelson < [EMAIL PROTECTED]> wrote:
>
> I had a similar problem and was able to fix it in Solr by manually
> buffering the responses to a StringWriter before sending it to Tomcat.
> Essentially, Tomcat's buffer will only hold so much and at that point
> it blocks (thus it always hangs at a constant number of documents).
> However, a better solution (to be implemented) is to use more
> intelligent code on the client to read the response at the same time
> that it is sending input -- not too difficult to do, though best to do
> with two threads ( i.e. fire off a thread to read the response before
> you send any data).  Seeing as the HttpClient code probably does this
> already, I'll most likely end up using that.
>
> On 7/31/06, sangraal aiken < [EMAIL PROTECTED]> wrote:
> > Those are some great ideas Chris... I'm going to try some of them
> out.  I'll
> > post the results when I get a chance to do more testing. Thanks.
> >
> > At this point I can work around the problem by ignoring Solr's
> response but
> > this is obviously not ideal. I would feel better knowing what is
> causing the
> > issue as well.
> >
> > -Sangraal
> >
> >
> >
> > On 7/29/06, Chris Hostetter < [EMAIL PROTECTED]> wrote:
> > >
> > >
> > > : Sure, the method that does all the work updating Solr is the
> > > doUpdate(String
> > > : s) method in the GanjaUpdate class I'm pasting below. It's hanging
> when
> > > I
> > > : try to read the response... the last output I receive in my log is
> Got
> > > : Reader...
> > >
> > > I don't have the means to try out this code right now ... but i
> can't see
> > > any obvious problems with it (there may be somewhere that you are
> opening
> > > a stream or reader and not closing it, but i didn't see one) ... i
> notice
> > > you are running this client on the same machine as Solr (hence the
> > > localhost URLs) did you by any chance try running the client on a
> seperate
> > > machine to see if hte number of updates before it hangs changes?
> > >
> > > my money is still on a filehandle resource limit somwhere ... if you
> are
> > > running on a system that has "lsof" (on some Unix/Linux
> installations you
> > > need sudo/su root permissions to run it) you can use "lsof -p "
> to
> > > look up what files/network connections are open for a given
> process.  You
> > > can try running that on both the client pid and the Solr server pid
> once
> > > it's hung -- You'll probably see a lot of Jar files in use for both,
> but
> > > if you see more then a few XML files open by the client, or more
> then a
> > > 1 TCP connection open by either the client or the server, there's
> your
> > > culprit.
> > >
> > > I'm not sure what Windows equivilent of lsof may exist.
> > >
> > > Wait ... i just had another thought
> > >
> > > You are using InputStreamReader to deal with the InputStreams of
> your
> > > remote XML files -- but you aren't specifying a charset, so it's
> using
> > > your system default which may be differnet from the charset of the
> > > orriginal XML files you are pulling from the URL -- which (i
> *think*)
> > > means that your InputStreamReader may in some cases fail to read all
> of
> > > the bytes of the stream, which might some dangling filehandles (i'm
> just
> > > guessing on that part ... i'm not acctually sure whta happens in
> that
> > > case).
> > >
> > > What if you simplify your code (for the purposes of testing) and
> just put
> > > the post-transform version ganja-full.xml in a big ass String
> variable in
> > > your java app and just call GanjaUpdate.doUpdate(bigAssString) over
> and
> > > over again ... does that cause the same problem?
> > >
> > >
> > > :
> > > : --
> > > :
> > > : package co

Re: Doc add limit

2006-07-31 Thread sangraal aiken

Very interesting... thanks Thom. I haven't given HttpClient a shot yet, but
will be soon.

-S

On 7/31/06, Thom Nelson <[EMAIL PROTECTED]> wrote:


I had a similar problem and was able to fix it in Solr by manually
buffering the responses to a StringWriter before sending it to Tomcat.
Essentially, Tomcat's buffer will only hold so much and at that point
it blocks (thus it always hangs at a constant number of documents).
However, a better solution (to be implemented) is to use more
intelligent code on the client to read the response at the same time
that it is sending input -- not too difficult to do, though best to do
with two threads (i.e. fire off a thread to read the response before
you send any data).  Seeing as the HttpClient code probably does this
already, I'll most likely end up using that.

On 7/31/06, sangraal aiken <[EMAIL PROTECTED]> wrote:
> Those are some great ideas Chris... I'm going to try some of them
out.  I'll
> post the results when I get a chance to do more testing. Thanks.
>
> At this point I can work around the problem by ignoring Solr's response
but
> this is obviously not ideal. I would feel better knowing what is causing
the
> issue as well.
>
> -Sangraal
>
>
>
> On 7/29/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
> >
> >
> > : Sure, the method that does all the work updating Solr is the
> > doUpdate(String
> > : s) method in the GanjaUpdate class I'm pasting below. It's hanging
when
> > I
> > : try to read the response... the last output I receive in my log is
Got
> > : Reader...
> >
> > I don't have the means to try out this code right now ... but i can't
see
> > any obvious problems with it (there may be somewhere that you are
opening
> > a stream or reader and not closing it, but i didn't see one) ... i
notice
> > you are running this client on the same machine as Solr (hence the
> > localhost URLs) did you by any chance try running the client on a
seperate
> > machine to see if hte number of updates before it hangs changes?
> >
> > my money is still on a filehandle resource limit somwhere ... if you
are
> > running on a system that has "lsof" (on some Unix/Linux installations
you
> > need sudo/su root permissions to run it) you can use "lsof -p " to
> > look up what files/network connections are open for a given
process.  You
> > can try running that on both the client pid and the Solr server pid
once
> > it's hung -- You'll probably see a lot of Jar files in use for both,
but
> > if you see more then a few XML files open by the client, or more then
a
> > 1 TCP connection open by either the client or the server, there's your
> > culprit.
> >
> > I'm not sure what Windows equivilent of lsof may exist.
> >
> > Wait ... i just had another thought
> >
> > You are using InputStreamReader to deal with the InputStreams of your
> > remote XML files -- but you aren't specifying a charset, so it's using
> > your system default which may be differnet from the charset of the
> > orriginal XML files you are pulling from the URL -- which (i *think*)
> > means that your InputStreamReader may in some cases fail to read all
of
> > the bytes of the stream, which might some dangling filehandles (i'm
just
> > guessing on that part ... i'm not acctually sure whta happens in that
> > case).
> >
> > What if you simplify your code (for the purposes of testing) and just
put
> > the post-transform version ganja-full.xml in a big ass String variable
in
> > your java app and just call GanjaUpdate.doUpdate(bigAssString) over
and
> > over again ... does that cause the same problem?
> >
> >
> > :
> > : --
> > :
> > : package com.iceninetech.solr.update;
> > :
> > : import com.iceninetech.xml.XMLTransformer;
> > :
> > : import java.io.*;
> > : import java.net.HttpURLConnection;
> > : import java.net.URL;
> > : import java.util.logging.Logger;
> > :
> > : public class GanjaUpdate {
> > :
> > :   private String updateSite = "";
> > :   private String XSL_URL = "http://localhost:8080/xsl/ganja.xsl";;
> > :
> > :   private static final File xmlStorageDir = new
> > : File("/source/solr/xml-dls/");
> > :
> > :   final Logger log = Logger.getLogger(GanjaUpdate.class.getName());
> > :
> > :   public GanjaUpdate(String siteName) {
> > : this.updateSite = siteName;
> > : log.info("GanjaUpdate is primed and ready to update " +
siteName);
> > :   }
> > :
> > :   public void update() {
> > : StringWriter sw = new StringWriter();
> > :
> > : try {
> > :   // transform gawkerInput XML to SOLR update XML
> > :   XMLTransformer transform = new XMLTransformer();
> > :   log.info("About to transform ganjaInput XML to Solr Update
XML");
> > :   transform.transform(getXML(), sw, getXSL());
> > :   log.info("Completed ganjaInput/SolrUpdate XML transform");
> > :
> > :   // Write transformed XML to Disk.
> > :   File transformedXML = new File(xmlStorageDir,
updateSite+".sml");
> > :   FileWriter fw = new FileWriter(transformedXML);
> > :   fw.write(sw.toStri

Re: Doc add limit

2006-07-31 Thread sangraal aiken

Chris, my response is below each of your paragraphs...


I don't have the means to try out this code right now ... but i can't see

any obvious problems with it (there may be somewhere that you are opening
a stream or reader and not closing it, but i didn't see one) ... i notice
you are running this client on the same machine as Solr (hence the
localhost URLs) did you by any chance try running the client on a seperate
machine to see if hte number of updates before it hangs changes?



When I run the client locally and the Solr server on a slower and separate
development box, the maximum number of updates drops to 3,219. So it's
almost as if it's related to some sort of timeout problem because the
maximum number of updates drops considerably on a slower machine, but it's
weird how consistent the number is. 6,144 locally, 5,000 something when I
run it on the external server, and 3,219 when the client is separate from
the server.

my money is still on a filehandle resource limit somwhere ... if you are

running on a system that has "lsof" (on some Unix/Linux installations you
need sudo/su root permissions to run it) you can use "lsof -p " to
look up what files/network connections are open for a given process.  You
can try running that on both the client pid and the Solr server pid once
it's hung -- You'll probably see a lot of Jar files in use for both, but
if you see more then a few XML files open by the client, or more then a
1 TCP connection open by either the client or the server, there's your
culprit.



The only output I get from 'lsof -p' that pertains to TCP connections are
the following...I'm not too sure how to interpret it though:
java4104 sangraal  261u  IPv6 0x5b060f0   0t0  TCP *:8009
(LISTEN)
java4104 sangraal  262u  IPv6 0x55d59e8   0t0  TCP
[::127.0.0.1]:8005
(LISTEN)
java4104 sangraal  263u  IPv6 0x53cc0e0   0t0  TCP [::127.0.0.1
]:http-alt->[::127.0.0.1]:51039 (ESTABLISHED)
java4104 sangraal  264u  IPv6 0x5b059d0   0t0  TCP [::127.0.0.1
]:51045->[::127.0.0.1]:http-alt (ESTABLISHED)
java4104 sangraal  265u  IPv6 0x53cc9c8   0t0  TCP [::127.0.0.1
]:http-alt->[::127.0.0.1]:51045 (ESTABLISHED)
java4104 sangraal   11u  IPv6 0x5b04f20   0t0  TCP *:http-alt
(LISTEN)
java4104 sangraal   12u  IPv6 0x5b06d68   0t0  TCP
localhost:51037->localhost:51036 (TIME_WAIT)

I'm not sure what Windows equivilent of lsof may exist.


Wait ... i just had another thought

You are using InputStreamReader to deal with the InputStreams of your
remote XML files -- but you aren't specifying a charset, so it's using
your system default which may be differnet from the charset of the
orriginal XML files you are pulling from the URL -- which (i *think*)
means that your InputStreamReader may in some cases fail to read all of
the bytes of the stream, which might some dangling filehandles (i'm just
guessing on that part ... i'm not acctually sure whta happens in that
case).

What if you simplify your code (for the purposes of testing) and just put
the post-transform version ganja-full.xml in a big ass String variable in
your java app and just call GanjaUpdate.doUpdate(bigAssString) over and
over again ... does that cause the same problem?



In the code, I read the XML with a StringReader and then pass it to
GanjaUpdate as a string anyway.  I've output the String object and verified
that it is in fact all there.


-Sangraal


Re: Doc add limit

2006-07-31 Thread Thom Nelson

I had a similar problem and was able to fix it in Solr by manually
buffering the responses to a StringWriter before sending it to Tomcat.
Essentially, Tomcat's buffer will only hold so much and at that point
it blocks (thus it always hangs at a constant number of documents).
However, a better solution (to be implemented) is to use more
intelligent code on the client to read the response at the same time
that it is sending input -- not too difficult to do, though best to do
with two threads (i.e. fire off a thread to read the response before
you send any data).  Seeing as the HttpClient code probably does this
already, I'll most likely end up using that.

On 7/31/06, sangraal aiken <[EMAIL PROTECTED]> wrote:

Those are some great ideas Chris... I'm going to try some of them out.  I'll
post the results when I get a chance to do more testing. Thanks.

At this point I can work around the problem by ignoring Solr's response but
this is obviously not ideal. I would feel better knowing what is causing the
issue as well.

-Sangraal



On 7/29/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
>
> : Sure, the method that does all the work updating Solr is the
> doUpdate(String
> : s) method in the GanjaUpdate class I'm pasting below. It's hanging when
> I
> : try to read the response... the last output I receive in my log is Got
> : Reader...
>
> I don't have the means to try out this code right now ... but i can't see
> any obvious problems with it (there may be somewhere that you are opening
> a stream or reader and not closing it, but i didn't see one) ... i notice
> you are running this client on the same machine as Solr (hence the
> localhost URLs) did you by any chance try running the client on a seperate
> machine to see if hte number of updates before it hangs changes?
>
> my money is still on a filehandle resource limit somwhere ... if you are
> running on a system that has "lsof" (on some Unix/Linux installations you
> need sudo/su root permissions to run it) you can use "lsof -p " to
> look up what files/network connections are open for a given process.  You
> can try running that on both the client pid and the Solr server pid once
> it's hung -- You'll probably see a lot of Jar files in use for both, but
> if you see more then a few XML files open by the client, or more then a
> 1 TCP connection open by either the client or the server, there's your
> culprit.
>
> I'm not sure what Windows equivilent of lsof may exist.
>
> Wait ... i just had another thought
>
> You are using InputStreamReader to deal with the InputStreams of your
> remote XML files -- but you aren't specifying a charset, so it's using
> your system default which may be differnet from the charset of the
> orriginal XML files you are pulling from the URL -- which (i *think*)
> means that your InputStreamReader may in some cases fail to read all of
> the bytes of the stream, which might some dangling filehandles (i'm just
> guessing on that part ... i'm not acctually sure whta happens in that
> case).
>
> What if you simplify your code (for the purposes of testing) and just put
> the post-transform version ganja-full.xml in a big ass String variable in
> your java app and just call GanjaUpdate.doUpdate(bigAssString) over and
> over again ... does that cause the same problem?
>
>
> :
> : --
> :
> : package com.iceninetech.solr.update;
> :
> : import com.iceninetech.xml.XMLTransformer;
> :
> : import java.io.*;
> : import java.net.HttpURLConnection;
> : import java.net.URL;
> : import java.util.logging.Logger;
> :
> : public class GanjaUpdate {
> :
> :   private String updateSite = "";
> :   private String XSL_URL = "http://localhost:8080/xsl/ganja.xsl";;
> :
> :   private static final File xmlStorageDir = new
> : File("/source/solr/xml-dls/");
> :
> :   final Logger log = Logger.getLogger(GanjaUpdate.class.getName());
> :
> :   public GanjaUpdate(String siteName) {
> : this.updateSite = siteName;
> : log.info("GanjaUpdate is primed and ready to update " + siteName);
> :   }
> :
> :   public void update() {
> : StringWriter sw = new StringWriter();
> :
> : try {
> :   // transform gawkerInput XML to SOLR update XML
> :   XMLTransformer transform = new XMLTransformer();
> :   log.info("About to transform ganjaInput XML to Solr Update XML");
> :   transform.transform(getXML(), sw, getXSL());
> :   log.info("Completed ganjaInput/SolrUpdate XML transform");
> :
> :   // Write transformed XML to Disk.
> :   File transformedXML = new File(xmlStorageDir, updateSite+".sml");
> :   FileWriter fw = new FileWriter(transformedXML);
> :   fw.write(sw.toString());
> :   fw.close();
> :
> :   // post to Solr
> :   log.info("About to update Solr for site " + updateSite);
> :   String result = this.doUpdate(sw.toString());
> :   log.info("Solr says: " + result);
> :   sw.close();
> : } catch (Exception e) {
> :   e.printStackTrace();
> : }
> :   }
> :
> :  

Re: Doc add limit

2006-07-31 Thread sangraal aiken

Those are some great ideas Chris... I'm going to try some of them out.  I'll
post the results when I get a chance to do more testing. Thanks.

At this point I can work around the problem by ignoring Solr's response but
this is obviously not ideal. I would feel better knowing what is causing the
issue as well.

-Sangraal



On 7/29/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:



: Sure, the method that does all the work updating Solr is the
doUpdate(String
: s) method in the GanjaUpdate class I'm pasting below. It's hanging when
I
: try to read the response... the last output I receive in my log is Got
: Reader...

I don't have the means to try out this code right now ... but i can't see
any obvious problems with it (there may be somewhere that you are opening
a stream or reader and not closing it, but i didn't see one) ... i notice
you are running this client on the same machine as Solr (hence the
localhost URLs) did you by any chance try running the client on a seperate
machine to see if hte number of updates before it hangs changes?

my money is still on a filehandle resource limit somwhere ... if you are
running on a system that has "lsof" (on some Unix/Linux installations you
need sudo/su root permissions to run it) you can use "lsof -p " to
look up what files/network connections are open for a given process.  You
can try running that on both the client pid and the Solr server pid once
it's hung -- You'll probably see a lot of Jar files in use for both, but
if you see more then a few XML files open by the client, or more then a
1 TCP connection open by either the client or the server, there's your
culprit.

I'm not sure what Windows equivilent of lsof may exist.

Wait ... i just had another thought

You are using InputStreamReader to deal with the InputStreams of your
remote XML files -- but you aren't specifying a charset, so it's using
your system default which may be differnet from the charset of the
orriginal XML files you are pulling from the URL -- which (i *think*)
means that your InputStreamReader may in some cases fail to read all of
the bytes of the stream, which might some dangling filehandles (i'm just
guessing on that part ... i'm not acctually sure whta happens in that
case).

What if you simplify your code (for the purposes of testing) and just put
the post-transform version ganja-full.xml in a big ass String variable in
your java app and just call GanjaUpdate.doUpdate(bigAssString) over and
over again ... does that cause the same problem?


:
: --
:
: package com.iceninetech.solr.update;
:
: import com.iceninetech.xml.XMLTransformer;
:
: import java.io.*;
: import java.net.HttpURLConnection;
: import java.net.URL;
: import java.util.logging.Logger;
:
: public class GanjaUpdate {
:
:   private String updateSite = "";
:   private String XSL_URL = "http://localhost:8080/xsl/ganja.xsl";;
:
:   private static final File xmlStorageDir = new
: File("/source/solr/xml-dls/");
:
:   final Logger log = Logger.getLogger(GanjaUpdate.class.getName());
:
:   public GanjaUpdate(String siteName) {
: this.updateSite = siteName;
: log.info("GanjaUpdate is primed and ready to update " + siteName);
:   }
:
:   public void update() {
: StringWriter sw = new StringWriter();
:
: try {
:   // transform gawkerInput XML to SOLR update XML
:   XMLTransformer transform = new XMLTransformer();
:   log.info("About to transform ganjaInput XML to Solr Update XML");
:   transform.transform(getXML(), sw, getXSL());
:   log.info("Completed ganjaInput/SolrUpdate XML transform");
:
:   // Write transformed XML to Disk.
:   File transformedXML = new File(xmlStorageDir, updateSite+".sml");
:   FileWriter fw = new FileWriter(transformedXML);
:   fw.write(sw.toString());
:   fw.close();
:
:   // post to Solr
:   log.info("About to update Solr for site " + updateSite);
:   String result = this.doUpdate(sw.toString());
:   log.info("Solr says: " + result);
:   sw.close();
: } catch (Exception e) {
:   e.printStackTrace();
: }
:   }
:
:   public File getXML() {
: String XML_URL = "http://localhost:8080/"; + updateSite + "/ganja-
: full.xml";
:
: // check for file
: File localXML = new File(xmlStorageDir, updateSite + ".xml");
:
: try {
:   if (localXML.createNewFile() && localXML.canWrite()) {
: // open connection
: log.info("Downloading: " + XML_URL);
: URL url = new URL(XML_URL);
: HttpURLConnection conn = (HttpURLConnection) url.openConnection
();
: conn.setRequestMethod("GET");
:
: // Read response to File
: log.info("Storing XML to File" + localXML.getCanonicalPath());
: FileOutputStream fos = new FileOutputStream(new
File(xmlStorageDir,
: updateSite + ".xml"));
:
: BufferedReader rd = new BufferedReader(new InputStreamReader(
: conn.getInputStream()));
: String line;
: while ((line = rd.readLine()) != null) {
: