Top Searches

2006-12-11 Thread sangraal aiken

I'm looking into creating something to track the top 10 - 20 searches that
run through Solr for a given period.

I could just create a counter object with an internal TreeMap or something
that just keeps count of the various terms, but it could grow very large
very fast and I'm not yet sure what implications this would have on memory
usage. Also, storing it in memory means it would be wiped out during a
restart, so it's not ideal.

Other ideas I had were storing them in a database table, or in a separate
Solr instance. Each method has it's own advantages and drawbacks.

Has anyone looked into or had any experience doing something like this? Any
info or advice would be appreciated.

-Sangraal A.


Re: Top Searches

2006-12-11 Thread sangraal aiken

That's a great idea, thanks Yonik.

-Sangraal

On 12/11/06, Yonik Seeley [EMAIL PROTECTED] wrote:


On 12/11/06, sangraal aiken [EMAIL PROTECTED] wrote:
 I'm looking into creating something to track the top 10 - 20 searches
that
 run through Solr for a given period.

For offline processing, using log files is the simplest thing... the
code remains separated, you can do historical processing if you keep
the logs, and it doesn't affect live queries.

It depends on how fresh the info needs to be and how it will be used.

-Yonik



Default XML Output Schema

2006-09-21 Thread sangraal aiken

Perhaps a silly questions, but I'm wondering if anyone can tell me why solr
outputs XML like this:

doc
int name=id201038/id
int name=siteId31/siteId
date name=modified2006-09-15T21:36:39.000Z/date
/doc

rather than like this:

doc
id type=int201038/id
siteId type=int31/siteId
modified type=date2006-09-15T21:36:39.000Z/modified
/doc

A front-end PHP developer I know is having trouble parsing the default Solr
output because of that format and mentioned it would be much easier in the
former format... so I was curious if there was a reason it is the way it is.

-Sangraal


Re: Default XML Output Schema

2006-09-21 Thread sangraal aiken

Thanks for the great explanation Yonik, I passed it on to my collegues for
reference... I knew there was a good reason.

-Sangraal

On 9/21/06, Yonik Seeley [EMAIL PROTECTED] wrote:


On 9/21/06, sangraal aiken [EMAIL PROTECTED] wrote:
 Perhaps a silly questions, but I'm wondering if anyone can tell me why
solr
 outputs XML like this:

During the initial development of Solr (2004), I remember throwing up
both options, and most developers preferred to have a limited number
of well defined tags.

It allows you to have rather arbitrary field names, which you couldn't
have if you used the field name as the tag.

It also allows consistency with custom data.  For example, here is the
representation of an array of integer:
arrint1/intint2/int/arr
If field names were used as tags, we would have to either make up a
dummy-name, or we wouldn't be able to use the same style.


 doc
 int name=id201038/id
 int name=siteId31/siteId
 date name=modified2006-09-15T21:36:39.000Z/date
 /doc

 rather than like this:

 doc
 id type=int201038/id
 siteId type=int31/siteId
 modified type=date2006-09-15T21:36:39.000Z/modified
 /doc

 A front-end PHP developer I know is having trouble parsing the default
Solr
 output because of that format and mentioned it would be much easier in
the
 former format... so I was curious if there was a reason it is the way it
is.

There are a number of options for you.
You could write your own QueryResponseWriter to output XML just as you
like it, or use an XSLT stylesheet in conjunction with
http://issues.apache.org/jira/browse/SOLR-49
or use another format such as JSON.

-Yonik



Re: Doc add limit, im experiencing it too

2006-09-06 Thread sangraal aiken

I sent out an email about this a while back, but basically this limit
appears only on Tomcat and only when Solr attempts to write to the response.


You can work around it by splitting up your posts so that you're posting
less than 5000 (or whatever your limit seems to be) at a time. You DO NOT
have to commit after each post. I recently indexed a 38 million document
data base with this problem and although it took about 8-9 hours it did
work... I only commited every 100,000 or so.

-Sangraal

On 9/6/06, Michael Imbeault [EMAIL PROTECTED] wrote:


Old issue (see
http://www.mail-archive.com/solr-user@lucene.apache.org/msg00651.html),
but I'm experiencing the same exact thing on windows xp, latest tomcat.
I noticed that the tomcat process gobbles memory (10 megs a second
maybe) and then jams at 125 megs. Can't find a fix yet. I'm using a php
interface and curl to post my xml, one document at a time, and commit
every 100 document. Indexing 3 docs, it hangs at maybe 5000. Anyone
got an idea on this one? It would be helpful. I may try to switch to
jetty tomorrow if nothing works :(

--
Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Re: Doc add limit

2006-07-31 Thread sangraal aiken

Those are some great ideas Chris... I'm going to try some of them out.  I'll
post the results when I get a chance to do more testing. Thanks.

At this point I can work around the problem by ignoring Solr's response but
this is obviously not ideal. I would feel better knowing what is causing the
issue as well.

-Sangraal



On 7/29/06, Chris Hostetter [EMAIL PROTECTED] wrote:



: Sure, the method that does all the work updating Solr is the
doUpdate(String
: s) method in the GanjaUpdate class I'm pasting below. It's hanging when
I
: try to read the response... the last output I receive in my log is Got
: Reader...

I don't have the means to try out this code right now ... but i can't see
any obvious problems with it (there may be somewhere that you are opening
a stream or reader and not closing it, but i didn't see one) ... i notice
you are running this client on the same machine as Solr (hence the
localhost URLs) did you by any chance try running the client on a seperate
machine to see if hte number of updates before it hangs changes?

my money is still on a filehandle resource limit somwhere ... if you are
running on a system that has lsof (on some Unix/Linux installations you
need sudo/su root permissions to run it) you can use lsof -p  to
look up what files/network connections are open for a given process.  You
can try running that on both the client pid and the Solr server pid once
it's hung -- You'll probably see a lot of Jar files in use for both, but
if you see more then a few XML files open by the client, or more then a
1 TCP connection open by either the client or the server, there's your
culprit.

I'm not sure what Windows equivilent of lsof may exist.

Wait ... i just had another thought

You are using InputStreamReader to deal with the InputStreams of your
remote XML files -- but you aren't specifying a charset, so it's using
your system default which may be differnet from the charset of the
orriginal XML files you are pulling from the URL -- which (i *think*)
means that your InputStreamReader may in some cases fail to read all of
the bytes of the stream, which might some dangling filehandles (i'm just
guessing on that part ... i'm not acctually sure whta happens in that
case).

What if you simplify your code (for the purposes of testing) and just put
the post-transform version ganja-full.xml in a big ass String variable in
your java app and just call GanjaUpdate.doUpdate(bigAssString) over and
over again ... does that cause the same problem?


:
: --
:
: package com.iceninetech.solr.update;
:
: import com.iceninetech.xml.XMLTransformer;
:
: import java.io.*;
: import java.net.HttpURLConnection;
: import java.net.URL;
: import java.util.logging.Logger;
:
: public class GanjaUpdate {
:
:   private String updateSite = ;
:   private String XSL_URL = http://localhost:8080/xsl/ganja.xsl;;
:
:   private static final File xmlStorageDir = new
: File(/source/solr/xml-dls/);
:
:   final Logger log = Logger.getLogger(GanjaUpdate.class.getName());
:
:   public GanjaUpdate(String siteName) {
: this.updateSite = siteName;
: log.info(GanjaUpdate is primed and ready to update  + siteName);
:   }
:
:   public void update() {
: StringWriter sw = new StringWriter();
:
: try {
:   // transform gawkerInput XML to SOLR update XML
:   XMLTransformer transform = new XMLTransformer();
:   log.info(About to transform ganjaInput XML to Solr Update XML);
:   transform.transform(getXML(), sw, getXSL());
:   log.info(Completed ganjaInput/SolrUpdate XML transform);
:
:   // Write transformed XML to Disk.
:   File transformedXML = new File(xmlStorageDir, updateSite+.sml);
:   FileWriter fw = new FileWriter(transformedXML);
:   fw.write(sw.toString());
:   fw.close();
:
:   // post to Solr
:   log.info(About to update Solr for site  + updateSite);
:   String result = this.doUpdate(sw.toString());
:   log.info(Solr says:  + result);
:   sw.close();
: } catch (Exception e) {
:   e.printStackTrace();
: }
:   }
:
:   public File getXML() {
: String XML_URL = http://localhost:8080/; + updateSite + /ganja-
: full.xml;
:
: // check for file
: File localXML = new File(xmlStorageDir, updateSite + .xml);
:
: try {
:   if (localXML.createNewFile()  localXML.canWrite()) {
: // open connection
: log.info(Downloading:  + XML_URL);
: URL url = new URL(XML_URL);
: HttpURLConnection conn = (HttpURLConnection) url.openConnection
();
: conn.setRequestMethod(GET);
:
: // Read response to File
: log.info(Storing XML to File + localXML.getCanonicalPath());
: FileOutputStream fos = new FileOutputStream(new
File(xmlStorageDir,
: updateSite + .xml));
:
: BufferedReader rd = new BufferedReader(new InputStreamReader(
: conn.getInputStream()));
: String line;
: while ((line = rd.readLine()) != null) {
:   line = line + '\n'; // add 

Re: Doc add limit

2006-07-28 Thread sangraal aiken

Sure, the method that does all the work updating Solr is the doUpdate(String
s) method in the GanjaUpdate class I'm pasting below. It's hanging when I
try to read the response... the last output I receive in my log is Got
Reader...

--

package com.iceninetech.solr.update;

import com.iceninetech.xml.XMLTransformer;

import java.io.*;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.logging.Logger;

public class GanjaUpdate {

 private String updateSite = ;
 private String XSL_URL = http://localhost:8080/xsl/ganja.xsl;;

 private static final File xmlStorageDir = new
File(/source/solr/xml-dls/);

 final Logger log = Logger.getLogger(GanjaUpdate.class.getName());

 public GanjaUpdate(String siteName) {
   this.updateSite = siteName;
   log.info(GanjaUpdate is primed and ready to update  + siteName);
 }

 public void update() {
   StringWriter sw = new StringWriter();

   try {
 // transform gawkerInput XML to SOLR update XML
 XMLTransformer transform = new XMLTransformer();
 log.info(About to transform ganjaInput XML to Solr Update XML);
 transform.transform(getXML(), sw, getXSL());
 log.info(Completed ganjaInput/SolrUpdate XML transform);

 // Write transformed XML to Disk.
 File transformedXML = new File(xmlStorageDir, updateSite+.sml);
 FileWriter fw = new FileWriter(transformedXML);
 fw.write(sw.toString());
 fw.close();

 // post to Solr
 log.info(About to update Solr for site  + updateSite);
 String result = this.doUpdate(sw.toString());
 log.info(Solr says:  + result);
 sw.close();
   } catch (Exception e) {
 e.printStackTrace();
   }
 }

 public File getXML() {
   String XML_URL = http://localhost:8080/; + updateSite + /ganja-
full.xml;

   // check for file
   File localXML = new File(xmlStorageDir, updateSite + .xml);

   try {
 if (localXML.createNewFile()  localXML.canWrite()) {
   // open connection
   log.info(Downloading:  + XML_URL);
   URL url = new URL(XML_URL);
   HttpURLConnection conn = (HttpURLConnection) url.openConnection();
   conn.setRequestMethod(GET);

   // Read response to File
   log.info(Storing XML to File + localXML.getCanonicalPath());
   FileOutputStream fos = new FileOutputStream(new File(xmlStorageDir,
updateSite + .xml));

   BufferedReader rd = new BufferedReader(new InputStreamReader(
conn.getInputStream()));
   String line;
   while ((line = rd.readLine()) != null) {
 line = line + '\n'; // add break after each line. It preserves
formatting.
 fos.write(line.getBytes(UTF8));
   }

   // close connections
   rd.close();
   fos.close();
   conn.disconnect();
   log.info(Got the XML... File saved.);
 }
   } catch (Exception e) {
 e.printStackTrace();
   }

   return localXML;
 }

 public File getXSL() {
   StringBuffer retVal = new StringBuffer();

   // check for file
   File localXSL = new File(xmlStorageDir, ganja.xsl);

   try {
 if (localXSL.createNewFile()  localXSL.canWrite()) {
   // open connection
   log.info(Downloading:  + XSL_URL);
   URL url = new URL(XSL_URL);
   HttpURLConnection conn = (HttpURLConnection) url.openConnection();
   conn.setRequestMethod(GET);
   // Read response
   BufferedReader rd = new BufferedReader(new InputStreamReader(
conn.getInputStream()));
   String line;
   while ((line = rd.readLine()) != null) {
 line = line + '\n';
 retVal.append(line);
   }
   // close connections
   rd.close();
   conn.disconnect();

   log.info(Got the XSLT.);

   // output file
   log.info(Storing XSL to File + localXSL.getCanonicalPath());
   FileOutputStream fos = new FileOutputStream(new File(xmlStorageDir,
ganja.xsl));
   fos.write(retVal.toString().getBytes());
   fos.close();
   log.info(File saved.);
 }
   } catch (Exception e) {
 e.printStackTrace();
   }
   return localXSL;
 }

 private String doUpdate(String sw) {
   StringBuffer updateResult = new StringBuffer();
   try {
 // open connection
 log.info(Connecting to and preparing to post to SolrUpdate
servlet.);
 URL url = new URL(http://localhost:8080/update;);
 HttpURLConnection conn = (HttpURLConnection) url.openConnection();
 conn.setRequestMethod(POST);
 conn.setRequestProperty(Content-Type, application/octet-stream);
 conn.setDoOutput(true);
 conn.setDoInput(true);
 conn.setUseCaches(false);

 // Write to server
 log.info(About to post to SolrUpdate servlet.);
 DataOutputStream output = new DataOutputStream(conn.getOutputStream
());
 output.writeBytes(sw);
 output.flush();
 output.close();
 log.info(Finished posting to SolrUpdate servlet.);

 // Read response
 log.info(Ready to read response.);
 BufferedReader rd = new BufferedReader(new InputStreamReader(
conn.getInputStream()));
 log.info(Got reader);
 

Re: Doc add limit

2006-07-28 Thread sangraal aiken

Yeah that code is pretty bare bones... I'm still in the initial testing
stage. You're right it definitely needs some more thourough work.

I did try removing all the conn.disconnect(); statements and there was no
change.

I'm going to give the Java Client code you sent me yesterday a shot and see
what happens with that. I'm kind of out of ideas for what could be causing
the hang... it really seems to just get locked in some sort of loop, but
there are absolutely no exceptions being thrown either on the Solr side or
the Client side... it just stops processing.

-Sangraal

On 7/28/06, Yonik Seeley [EMAIL PROTECTED] wrote:


It may be some sort of weird interaction with persistent connections
and timeouts (both client and server have connection timeouts I
assume).

Does anything change if you remove your .disconnect() call (it
shouldn't be needed).
Do you ever see any exceptions in the client side?

The code you show probably needs more error handling (finally blocks
with closes), but  if you don't see any stack traces from your
e.printStackTrace() then it doesn't have anything to do with this
problem.

Getting all the little details of connection handling correct can be
tough... it's probably a good idea if we work toward common client
libraries so everyone doesn't have to reinvent them.

-Yonik

On 7/28/06, sangraal aiken [EMAIL PROTECTED] wrote:
 Sure, the method that does all the work updating Solr is the
doUpdate(String
 s) method in the GanjaUpdate class I'm pasting below. It's hanging when
I
 try to read the response... the last output I receive in my log is Got
 Reader...

 --

 package com.iceninetech.solr.update;

 import com.iceninetech.xml.XMLTransformer;

 import java.io.*;
 import java.net.HttpURLConnection;
 import java.net.URL;
 import java.util.logging.Logger;

 public class GanjaUpdate {

   private String updateSite = ;
   private String XSL_URL = http://localhost:8080/xsl/ganja.xsl;;

   private static final File xmlStorageDir = new
 File(/source/solr/xml-dls/);

   final Logger log = Logger.getLogger(GanjaUpdate.class.getName());

   public GanjaUpdate(String siteName) {
 this.updateSite = siteName;
 log.info(GanjaUpdate is primed and ready to update  + siteName);
   }

   public void update() {
 StringWriter sw = new StringWriter();

 try {
   // transform gawkerInput XML to SOLR update XML
   XMLTransformer transform = new XMLTransformer();
   log.info(About to transform ganjaInput XML to Solr Update XML);
   transform.transform(getXML(), sw, getXSL());
   log.info(Completed ganjaInput/SolrUpdate XML transform);

   // Write transformed XML to Disk.
   File transformedXML = new File(xmlStorageDir, updateSite+.sml);
   FileWriter fw = new FileWriter(transformedXML);
   fw.write(sw.toString());
   fw.close();

   // post to Solr
   log.info(About to update Solr for site  + updateSite);
   String result = this.doUpdate(sw.toString());
   log.info(Solr says:  + result);
   sw.close();
 } catch (Exception e) {
   e.printStackTrace();
 }
   }

   public File getXML() {
 String XML_URL = http://localhost:8080/; + updateSite + /ganja-
 full.xml;

 // check for file
 File localXML = new File(xmlStorageDir, updateSite + .xml);

 try {
   if (localXML.createNewFile()  localXML.canWrite()) {
 // open connection
 log.info(Downloading:  + XML_URL);
 URL url = new URL(XML_URL);
 HttpURLConnection conn = (HttpURLConnection) url.openConnection
();
 conn.setRequestMethod(GET);

 // Read response to File
 log.info(Storing XML to File + localXML.getCanonicalPath());
 FileOutputStream fos = new FileOutputStream(new
File(xmlStorageDir,
 updateSite + .xml));

 BufferedReader rd = new BufferedReader(new InputStreamReader(
 conn.getInputStream()));
 String line;
 while ((line = rd.readLine()) != null) {
   line = line + '\n'; // add break after each line. It preserves
 formatting.
   fos.write(line.getBytes(UTF8));
 }

 // close connections
 rd.close();
 fos.close();
 conn.disconnect();
 log.info(Got the XML... File saved.);
   }
 } catch (Exception e) {
   e.printStackTrace();
 }

 return localXML;
   }

   public File getXSL() {
 StringBuffer retVal = new StringBuffer();

 // check for file
 File localXSL = new File(xmlStorageDir, ganja.xsl);

 try {
   if (localXSL.createNewFile()  localXSL.canWrite()) {
 // open connection
 log.info(Downloading:  + XSL_URL);
 URL url = new URL(XSL_URL);
 HttpURLConnection conn = (HttpURLConnection) url.openConnection
();
 conn.setRequestMethod(GET);
 // Read response
 BufferedReader rd = new BufferedReader(new InputStreamReader(
 conn.getInputStream()));
 String line;
 while ((line = rd.readLine

Re: Doc add limit

2006-07-27 Thread sangraal aiken

Yeah, I'm closing them.  Here's the method:

-
 private String doUpdate(String sw) {
   StringBuffer updateResult = new StringBuffer();
   try {
 // open connection
 log.info(Connecting to and preparing to post to SolrUpdate
servlet.);
 URL url = new URL(http://localhost:8080/update;);
 HttpURLConnection conn = (HttpURLConnection) url.openConnection();
 conn.setRequestMethod(POST);
 conn.setRequestProperty(Content-Type, application/octet-stream);
 conn.setDoOutput(true);
 conn.setDoInput(true);
 conn.setUseCaches(false);

 // Write to server
 log.info(About to post to SolrUpdate servlet.);
 DataOutputStream output = new DataOutputStream(conn.getOutputStream
());
 output.writeBytes(sw);
 output.flush();
 output.close();
 log.info(Finished posting to SolrUpdate servlet.);

 // Read response
 log.info(Ready to read response.);
 BufferedReader rd = new BufferedReader(new InputStreamReader(
conn.getInputStream()));
 log.info(Got reader);
 String line;
 while ((line = rd.readLine()) != null) {
   log.info(Writing to result...);
   updateResult.append(line);
 }
 rd.close();

 // close connections
 conn.disconnect();

 log.info(Done updating Solr for site + updateSite);
   } catch (Exception e) {
 e.printStackTrace();
   }

   return updateResult.toString();
 }
}

-Sangraal

On 7/27/06, Yonik Seeley [EMAIL PROTECTED] wrote:


Are you reading the response and closing the connection?  If not, you
are probably running out of socket connections.

-Yonik

On 7/27/06, sangraal aiken [EMAIL PROTECTED] wrote:
 Yonik,
 It looks like the problem is with the way I'm posting to the SolrUpdate
 servlet. I am able to use curl to post the data to my tomcat instance
 without a problem. It only fails when I try to handle the http post from
 java... my code is below:

   URL url = new URL(http://localhost:8983/solr/update;);
   HttpURLConnection conn = (HttpURLConnection) url.openConnection();
   conn.setRequestMethod(POST);
   conn.setRequestProperty(Content-Type,
application/octet-stream);
   conn.setDoOutput(true);
   conn.setDoInput(true);
   conn.setUseCaches(false);

   // Write to server
   log.info(About to post to SolrUpdate servlet.);
   DataOutputStream output = new DataOutputStream(
conn.getOutputStream
 ());
   output.writeBytes(sw);
   output.flush();
   log.info(Finished posting to SolrUpdate servlet.);

 -Sangraal

 On 7/27/06, Yonik Seeley [EMAIL PROTECTED] wrote:
 
  On 7/26/06, sangraal aiken [EMAIL PROTECTED] wrote:
   I removed everything from the Add xml so the docs looked like this:
  
   doc
   field name=id187880/field
   /doc
   doc
   field name=id187852/field
   /doc
  
   and it still hung at 6,144...
 
  Maybe you can try the following simple Python client to try and rule
  out some kind of different client interactions... the attached script
  adds 10,000 documents and works fine for me in WinXP w/ Tomcat 5.5.17
  and Jetty
 
  -Yonik
 
 
   solr.py --
  import httplib
  import socket
 
  class SolrConnection:
def __init__(self, host='localhost:8983', solrBase='/solr'):
  self.host = host
  self.solrBase = solrBase
  #a connection to the server is not opened at this point.
  self.conn = httplib.HTTPConnection(self.host)
  #self.conn.set_debuglevel(100)
  self.postheaders = {Connection:close}
 
def doUpdateXML(self, request):
  try:
self.conn.request('POST', self.solrBase+'/update', request,
  self.postheaders)
  except (socket.error,httplib.CannotSendRequest) :
#reconnect in case the connection was broken from the server
going
  down,
#the server timing out our persistent connection, or another
#network failure.
#Also catch httplib.CannotSendRequest because the HTTPConnection
  object
#can get in a bad state.
self.conn.close()
self.conn.connect()
self.conn.request('POST', self.solrBase+'/update', request,
  self.postheaders)
 
  rsp = self.conn.getresponse()
  #print rsp.status, rsp.reason
  data = rsp.read()
  #print data=,data
  self.conn.close()
 
def delete(self, id):
  xstr = 'deleteid'+id+'/id/delete'
  self.doUpdateXML(xstr)
 
def add(self, **fields):
  #todo: XML escaping
  flist=['field name=%s%s/field' % f for f in fields.items() ]
  flist.insert(0,'adddoc')
  flist.append('/doc/add')
  xstr = ''.join(flist)
  self.doUpdateXML(xstr)
 
  c = SolrConnection()
  #for i in range(1):
  #  c.delete(str(i))
  for i in range(1):
c.add(id=i)



Re: Doc add limit

2006-07-27 Thread sangraal aiken
/06, sangraal aiken [EMAIL PROTECTED] wrote:
 Mike,
  I've been posting with the content type set like this:
   conn.setRequestProperty(Content-Type,
application/octet-stream);

 I tried your suggestion though, and unfortunately there was no change.
   conn.setRequestProperty(Content-Type, text/xml;
charset=utf-8);

 -Sangraal


 On 7/27/06, Mike Klaas [EMAIL PROTECTED] wrote:
 
  On 7/27/06, Yonik Seeley [EMAIL PROTECTED] wrote:
 
   class SolrConnection:
 def __init__(self, host='localhost:8983', solrBase='/solr'):
   self.host = host
   self.solrBase = solrBase
   #a connection to the server is not opened at this point.
   self.conn = httplib.HTTPConnection(self.host)
   #self.conn.set_debuglevel(100)
   self.postheaders = {Connection:close}
  
 def doUpdateXML(self, request):
   try:
 self.conn.request('POST', self.solrBase+'/update', request,
   self.postheaders)
 
  Disgressive note: I'm not sure if it is necessary with tomcat, but in
  my experience driving solr with python using Jetty, it was necessary
  to specify the content-type when posting utf-8 data:
 
  self.postheaders.update({'Content-Type': 'text/xml; charset=utf-8'})
 
  -Mike
 








Re: Doc add limit

2006-07-27 Thread sangraal aiken

I'll give that a shot...

Thanks again for all your help.

-S

On 7/27/06, Yonik Seeley [EMAIL PROTECTED] wrote:


You might also try the Java update client here:
http://issues.apache.org/jira/browse/SOLR-20

-Yonik



Re: Doc add limit

2006-07-26 Thread sangraal aiken
)
at org.apache.catalina.valves.ErrorReportValve.invoke(
ErrorReportValve.java:105)
at org.apache.catalina.core.StandardEngineValve.invoke(
StandardEngineValve.java:107)
at org.apache.catalina.connector.CoyoteAdapter.service(
CoyoteAdapter.java:148)
at org.apache.coyote.http11.Http11Processor.process(
Http11Processor.java:869)
at
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection
(Http11BaseProtocol.java:664)
at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(
PoolTcpEndpoint.java:527)
at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(
LeaderFollowerWorkerThread.java:80)
at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(
ThreadPool.java:684)
at java.lang.Thread.run(Thread.java:613)


-Yonik



On 7/26/06, sangraal aiken [EMAIL PROTECTED] wrote:
 Hey there... I'm having an issue with large doc updates on my solr
 installation. I'm adding in batches between 2-20,000 docs at a time and
I've
 noticed Solr seems to hang at 6,144 docs every time. Breaking the adds
into
 smaller batches works just fine, but I was wondering if anyone knew why
this
 would happen. I've tried doubling memory as well as tweaking various
config
 options but nothing seems to let me break the 6,144 barrier.

 This is the output from Solr admin. Any help would be greatly
appreciated.


 *name: * updateHandler  *class: *
 org.apache.solr.update.DirectUpdateHandler2  *version: * 1.0
  *description:
 * Update handler that efficiently directly updates the on-disk main
lucene
 index  *stats: *commits : 0
 optimizes : 0
 docsPending : 6144
 deletesPending : 6144
 adds : 6144
 deletesById : 0
 deletesByQuery : 0
 errors : 0
 cumulative_adds : 6144
 cumulative_deletesById : 0
 cumulative_deletesByQuery : 0
 cumulative_errors : 0
 docsDeleted : 0





Re: Doc add limit

2006-07-26 Thread sangraal aiken

I see the problem on Mac OS X/JDK: 1.5.0_06 and Debian/JDK: 1.5.0_07.

I don't think it's a socket problem, because I can initiate additional
updates while the server is hung... weird I know.

Thanks for all your help, I'll send a post if/when I find a solution.

-S

On 7/26/06, Yonik Seeley [EMAIL PROTECTED] wrote:


Tomcat problem, or a Solr problem that is only manifesting on your
platform, or a JVM or libc problem, or even a client update problem...
(possibly you might be exhausting the number of sockets in the server
by using persistent connections with a long timeout and never reusing
them?)

What is your OS/JVM?

-Yonik

On 7/26/06, sangraal aiken [EMAIL PROTECTED] wrote:
 Right now the heap is set to 512M but I've increased it up to 2GB and
yet it
 still hangs at the same number 6,144...

 Here's something interesting... I pushed this code over to a different
 server and tried an update. On that server it's hanging on #5,267. Then
 tomcat seems to try to reload the webapp... indefinitely.

 So I guess this is looking more like a tomcat problem more than a
 lucene/solr problem huh?

 -Sangraal

 On 7/26/06, Yonik Seeley [EMAIL PROTECTED] wrote:
 
  So it looks like your client is hanging trying to send somethig over
  the socket to the server and blocking... probably because Tomcat isn't
  reading anything from the socket because it's busy trying to restart
  the webapp.
 
  What is the heap size of the server? try increasing it... maybe tomcat
  could have detected low memory and tried to reload the webapp.
 
  -Yonik
 
  On 7/26/06, sangraal aiken [EMAIL PROTECTED] wrote:
   Thanks for you help Yonik, I've responded to your questions below:
  
   On 7/26/06, Yonik Seeley [EMAIL PROTECTED] wrote:
   
It's possible it's not hanging, but just takes a long time on a
specific add.  This is because Lucene will occasionally merge
segments.  When very large segments are merged, it can take a long
time.
  
  
   I've left it running (hung) for up to a half hour at a time and I've
   verified that my cpu idles during the hang. I have witnessed much
  shorter
   hangs on the ramp up to my 6,144 limit but they have been more like
2 -
  10
   seconds in length. Perhaps this is the Lucene merging you mentioned.
  
   In the log file, add commands are followed by the number of
milliseconds the operation took.  Next time Solr hangs, wait for a
number of minutes until you see the operation logged and note how
long
it took.
  
  
   Here are the last 5 log entries before the hang the last one is doc
  #6,144.
   Also it looks like Tomcat is trying to redeploy the webapp those
last
  tomcat
   entries repeat indefinitely every 10 seconds or so. Perhaps this is
a
  Tomcat
   problem?
  
   Jul 26, 2006 1:25:28 PM org.apache.solr.core.SolrCore update
   INFO: add (id=110705) 0 36596
   Jul 26, 2006 1:25:28 PM org.apache.solr.core.SolrCore update
   INFO: add (id=110700) 0 36600
   Jul 26, 2006 1:25:28 PM org.apache.solr.core.SolrCore update
   INFO: add (id=110688) 0 36603
   Jul 26, 2006 1:25:28 PM org.apache.solr.core.SolrCore update
   INFO: add (id=110690) 0 36608
   Jul 26, 2006 1:25:28 PM org.apache.solr.core.SolrCore update
   INFO: add (id=110686) 0 36611
   Jul 26, 2006 1:25:36 PM
  org.apache.catalina.startup.HostConfigcheckResources
   FINE: Checking context[] redeploy resource /source/solr/apache-
  tomcat-5.5.17
   /webapps/ROOT
   Jul 26, 2006 1:25:36 PM
  org.apache.catalina.startup.HostConfigcheckResources
   FINE: Checking context[] redeploy resource /source/solr/apache-
  tomcat-5.5.17
   /webapps/ROOT/META-INF/context.xml
   Jul 26, 2006 1:25:36 PM
  org.apache.catalina.startup.HostConfigcheckResources
   FINE: Checking context[] reload resource /source/solr/apache-
  tomcat-5.5.17
   /webapps/ROOT/WEB-INF/web.xml
   Jul 26, 2006 1:25:36 PM
  org.apache.catalina.startup.HostConfigcheckResources
   FINE: Checking context[] reload resource /source/solr/apache-
  tomcat-5.5.17
   /webapps/ROOT/META-INF/context.xml
   Jul 26, 2006 1:25:36 PM
  org.apache.catalina.startup.HostConfigcheckResources
   FINE: Checking context[] reload resource /source/solr/apache-
  tomcat-5.5.17
   /conf/context.xml
  
   How many documents are in the index before you do a batch that
causes
a hang?  Does it happen on the first batch?  If so, you might be
seeing some other bug.  What appserver are you using?  Do the
admin
pages respond when you see this hang?  If so, what does a stack
trace
look like?
  
  
   I actually don't think I had the problem on the first batch, in fact
my
   first batch contained very close to 6,144 documents so perhaps there
is
  a
   relation there. Right now, I'm adding to an index with close to
90,000
   documents in it.
   I'm running Tomcat 5.5.17 and the admin pages respond just fine when
  it's
   hung... I did a thread dump and this is the trace of my update:
  
   http-8080-Processor25 Id=33 in RUNNABLE (running in native) total
cpu
   time

Re: Doc add limit

2006-07-26 Thread sangraal aiken

I removed everything from the Add xml so the docs looked like this:

doc
field name=id187880/field
/doc
doc
field name=id187852/field
/doc

and it still hung at 6,144...

-S


On 7/26/06, Yonik Seeley [EMAIL PROTECTED] wrote:


If you narrow the docs down to just the id field, does it still
happen at the same place?

-Yonik

On 7/26/06, sangraal aiken [EMAIL PROTECTED] wrote:
 I see the problem on Mac OS X/JDK: 1.5.0_06 and Debian/JDK: 1.5.0_07.

 I don't think it's a socket problem, because I can initiate additional
 updates while the server is hung... weird I know.

 Thanks for all your help, I'll send a post if/when I find a solution.

 -S

 On 7/26/06, Yonik Seeley [EMAIL PROTECTED] wrote:
 
  Tomcat problem, or a Solr problem that is only manifesting on your
  platform, or a JVM or libc problem, or even a client update problem...
  (possibly you might be exhausting the number of sockets in the server
  by using persistent connections with a long timeout and never reusing
  them?)
 
  What is your OS/JVM?
 
  -Yonik
 
  On 7/26/06, sangraal aiken [EMAIL PROTECTED] wrote:
   Right now the heap is set to 512M but I've increased it up to 2GB
and
  yet it
   still hangs at the same number 6,144...
  
   Here's something interesting... I pushed this code over to a
different
   server and tried an update. On that server it's hanging on #5,267.
Then
   tomcat seems to try to reload the webapp... indefinitely.
  
   So I guess this is looking more like a tomcat problem more than a
   lucene/solr problem huh?
  
   -Sangraal
  
   On 7/26/06, Yonik Seeley [EMAIL PROTECTED] wrote:
   
So it looks like your client is hanging trying to send somethig
over
the socket to the server and blocking... probably because Tomcat
isn't
reading anything from the socket because it's busy trying to
restart
the webapp.
   
What is the heap size of the server? try increasing it... maybe
tomcat
could have detected low memory and tried to reload the webapp.
   
-Yonik
   
On 7/26/06, sangraal aiken [EMAIL PROTECTED] wrote:
 Thanks for you help Yonik, I've responded to your questions
below:

 On 7/26/06, Yonik Seeley [EMAIL PROTECTED] wrote:
 
  It's possible it's not hanging, but just takes a long time on
a
  specific add.  This is because Lucene will occasionally merge
  segments.  When very large segments are merged, it can take a
long
  time.


 I've left it running (hung) for up to a half hour at a time and
I've
 verified that my cpu idles during the hang. I have witnessed
much
shorter
 hangs on the ramp up to my 6,144 limit but they have been more
like
  2 -
10
 seconds in length. Perhaps this is the Lucene merging you
mentioned.

 In the log file, add commands are followed by the number of
  milliseconds the operation took.  Next time Solr hangs, wait
for a
  number of minutes until you see the operation logged and note
how
  long
  it took.


 Here are the last 5 log entries before the hang the last one is
doc
#6,144.
 Also it looks like Tomcat is trying to redeploy the webapp those
  last
tomcat
 entries repeat indefinitely every 10 seconds or so. Perhaps this
is
  a
Tomcat
 problem?

 Jul 26, 2006 1:25:28 PM org.apache.solr.core.SolrCore update
 INFO: add (id=110705) 0 36596
 Jul 26, 2006 1:25:28 PM org.apache.solr.core.SolrCore update
 INFO: add (id=110700) 0 36600
 Jul 26, 2006 1:25:28 PM org.apache.solr.core.SolrCore update
 INFO: add (id=110688) 0 36603
 Jul 26, 2006 1:25:28 PM org.apache.solr.core.SolrCore update
 INFO: add (id=110690) 0 36608
 Jul 26, 2006 1:25:28 PM org.apache.solr.core.SolrCore update
 INFO: add (id=110686) 0 36611
 Jul 26, 2006 1:25:36 PM
org.apache.catalina.startup.HostConfigcheckResources
 FINE: Checking context[] redeploy resource /source/solr/apache-
tomcat-5.5.17
 /webapps/ROOT
 Jul 26, 2006 1:25:36 PM
org.apache.catalina.startup.HostConfigcheckResources
 FINE: Checking context[] redeploy resource /source/solr/apache-
tomcat-5.5.17
 /webapps/ROOT/META-INF/context.xml
 Jul 26, 2006 1:25:36 PM
org.apache.catalina.startup.HostConfigcheckResources
 FINE: Checking context[] reload resource /source/solr/apache-
tomcat-5.5.17
 /webapps/ROOT/WEB-INF/web.xml
 Jul 26, 2006 1:25:36 PM
org.apache.catalina.startup.HostConfigcheckResources
 FINE: Checking context[] reload resource /source/solr/apache-
tomcat-5.5.17
 /webapps/ROOT/META-INF/context.xml
 Jul 26, 2006 1:25:36 PM
org.apache.catalina.startup.HostConfigcheckResources
 FINE: Checking context[] reload resource /source/solr/apache-
tomcat-5.5.17
 /conf/context.xml

 How many documents are in the index before you do a batch that
  causes
  a hang?  Does it happen on the first batch?  If so, you might
be
  seeing some other bug.  What appserver