Re: Own Similarity Class in Solr
:I would like to alter the similarity behaviour of solr to remove : the fieldnorm factor in the similarity calculations. As far as I : read, I need to recreate my own similarity class and import it into : solr using the similarity config in schema.xml. : :Has anybody already tweaked or played with this topic, and might : give me some code or advices ? as you're already noticed, you can specify the Similarity class at runtime via the schema.xml -- the only Solr specific aspect of this making sure your Similarity class is in your servlet containers classpath (exactly how you do this depends on your servlet container) searching the java-dev and java-user Lucene mailing lists is the best bet for finding discussions on writing your own similarity, there are also some examples in the main Lucene code base... contrib/miscellaneous/src/java/org/apache/lucene/misc/SweetSpotSimilarity.java src/test/org/apache/lucene/search/TestDisjunctionMaxQuery.java ...if your main interest is just eliminating norms, there is a special option for that in Lucene Fields called Omit Norms (it not only eliminates the effects of norms on scoring, but it saves space in your index as well) in Solr you can turn it on/off per field or fieldtype using the omitNorms=true option in the schema.xml -Hoss
Re: Own Similarity Class in Solr
Hi Chris, thanks for the details, I am meanwhile poking around with my own class which I defined in the schema.xml everything is working perfectly there. But I have still the problem with the normalization, I try to change several parameters to fix it to 1.0, this does indeed change the scoring but still not the real way I need it. It seems that it is always the fieldNorm which is playing, but where is this field really from ? In the Similarity Class I don't find this term to alter. Let me give a short example what goes wrong : I have a field searchname with a boost of 3.0 during the document.add. Another field text is a copyField of several entries, this one does not have a boost factor, but indeed more data in it. In this text is a copy of a field where the text searched is in there 3times. This entry has the score : 5.5930133 But I have also entries where the searchname has the same word in it, but this one have a score of 1.9975047. Currently my class is like this (I took the DefaultSimilarity as a basis); - lengthNorm is fixed to 1.0 - tf fixed to 1.0 - idf fixed to 1.0 With these changes, might it be possible that I've deactivated the boost on the different Fields. What I need is, a search, which will handle each document the same, regardless of the frequency and the size, it shall calculate the score only on the boost factors, so a document with a hight boostfactor and the same text in it as another one with less factor shall be before the others. Something I do might be completely wrong, perhaps You have an idea ? Thanks, Tom
add/update index
Hi, I have created a process which uses xsl to convert my data to the form indicated in the examples so that it can be added to the index as the solr tutorial indicates: add doc field name=fieldvalue/field ... /doc /add In some cases the xsl process will create a field element with no data. (ie field name=field/) Is this considered bad input and will not be accepted? Or is this something that solr should deal with? Currently for each field element with no data I receive the message: result status=1java.lang.NullPointerException at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:78) at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:74) at org.apache.solr.core.SolrCore.readDoc(SolrCore.java:917) at org.apache.solr.core.SolrCore.update(SolrCore.java:685) at org.apache.solr.servlet.SolrUpdateServlet.doPost(SolrUpdateServlet.java:52) ... /result Just curious if the gurus out there think I should deal with the null values in my xsl process or if this can be dealt with in solr itself? Thanks, Tricia ps. Thanks for the timely fix for the UTF-8 issue!
Re: add/update index
Thanks Yonik, That's exactly what I needed to know. I'll adapt my xsl process to omit null values. Tricia On Thu, 27 Jul 2006, Yonik Seeley wrote: On 7/27/06, Tricia Williams [EMAIL PROTECTED] wrote: Hi, I have created a process which uses xsl to convert my data to the form indicated in the examples so that it can be added to the index as the solr tutorial indicates: add doc field name=fieldvalue/field ... /doc /add In some cases the xsl process will create a field element with no data. (ie field name=field/) Is this considered bad input and will not be accepted? If the desired semantics are the field doesn't exist or null value then yes. There isn't a way to represent a field without a value in Lucene except to not add the field for that document. If it's totally ignored, it probably shouldn't be in the XML. Now, one might think we could drop fields with no value, but that's problematic because it goes against the XML standard: http://www.w3.org/TR/REC-xml/#sec-starttags [Definition: An element with no content is said to be empty.] The representation of an empty element is either a start-tag immediately followed by an end-tag, or an empty-element tag. [Definition: An empty-element tag takes a special form:] So a/a and a/ are supposed to be equivalent. Given that, it does look like Solr should treat field name=val/ like a zero-length string (but that's not what you wanted, right?) -Yonik
Solr's JSON, Python, Ruby output format
Solr now has a JSON response format, in addition to Python and Ruby versions that can be directly eval'd. http://wiki.apache.org/solr/SolJSON -Yonik
Re: Doc add limit
On 7/26/06, sangraal aiken [EMAIL PROTECTED] wrote: I removed everything from the Add xml so the docs looked like this: doc field name=id187880/field /doc doc field name=id187852/field /doc and it still hung at 6,144... Maybe you can try the following simple Python client to try and rule out some kind of different client interactions... the attached script adds 10,000 documents and works fine for me in WinXP w/ Tomcat 5.5.17 and Jetty -Yonik solr.py -- import httplib import socket class SolrConnection: def __init__(self, host='localhost:8983', solrBase='/solr'): self.host = host self.solrBase = solrBase #a connection to the server is not opened at this point. self.conn = httplib.HTTPConnection(self.host) #self.conn.set_debuglevel(100) self.postheaders = {Connection:close} def doUpdateXML(self, request): try: self.conn.request('POST', self.solrBase+'/update', request, self.postheaders) except (socket.error,httplib.CannotSendRequest) : #reconnect in case the connection was broken from the server going down, #the server timing out our persistent connection, or another #network failure. #Also catch httplib.CannotSendRequest because the HTTPConnection object #can get in a bad state. self.conn.close() self.conn.connect() self.conn.request('POST', self.solrBase+'/update', request, self.postheaders) rsp = self.conn.getresponse() #print rsp.status, rsp.reason data = rsp.read() #print data=,data self.conn.close() def delete(self, id): xstr = 'deleteid'+id+'/id/delete' self.doUpdateXML(xstr) def add(self, **fields): #todo: XML escaping flist=['field name=%s%s/field' % f for f in fields.items() ] flist.insert(0,'adddoc') flist.append('/doc/add') xstr = ''.join(flist) self.doUpdateXML(xstr) c = SolrConnection() #for i in range(1): # c.delete(str(i)) for i in range(1): c.add(id=i)
Re: Doc add limit
On 7/27/06, Yonik Seeley [EMAIL PROTECTED] wrote: class SolrConnection: def __init__(self, host='localhost:8983', solrBase='/solr'): self.host = host self.solrBase = solrBase #a connection to the server is not opened at this point. self.conn = httplib.HTTPConnection(self.host) #self.conn.set_debuglevel(100) self.postheaders = {Connection:close} def doUpdateXML(self, request): try: self.conn.request('POST', self.solrBase+'/update', request, self.postheaders) Disgressive note: I'm not sure if it is necessary with tomcat, but in my experience driving solr with python using Jetty, it was necessary to specify the content-type when posting utf-8 data: self.postheaders.update({'Content-Type': 'text/xml; charset=utf-8'}) -Mike
Re: Doc add limit
Hi Sangraal: Sorry--I tried not to imply that this might affect your issue. You may have to crank up the solr logging to determine where it is freezing (and what might be happening). It is certainly worth investigating why this occurs, but I wonder about the advantages of using such huge batches. Assuming a few hundred bytes per document, 6100 docs produces a POST over 1MB in size. -Mike On 7/27/06, sangraal aiken [EMAIL PROTECTED] wrote: Mike, I've been posting with the content type set like this: conn.setRequestProperty(Content-Type, application/octet-stream); I tried your suggestion though, and unfortunately there was no change. conn.setRequestProperty(Content-Type, text/xml; charset=utf-8); -Sangraal On 7/27/06, Mike Klaas [EMAIL PROTECTED] wrote: On 7/27/06, Yonik Seeley [EMAIL PROTECTED] wrote: class SolrConnection: def __init__(self, host='localhost:8983', solrBase='/solr'): self.host = host self.solrBase = solrBase #a connection to the server is not opened at this point. self.conn = httplib.HTTPConnection(self.host) #self.conn.set_debuglevel(100) self.postheaders = {Connection:close} def doUpdateXML(self, request): try: self.conn.request('POST', self.solrBase+'/update', request, self.postheaders) Disgressive note: I'm not sure if it is necessary with tomcat, but in my experience driving solr with python using Jetty, it was necessary to specify the content-type when posting utf-8 data: self.postheaders.update({'Content-Type': 'text/xml; charset=utf-8'}) -Mike
Re: Doc add limit
Yeah, I'm closing them. Here's the method: - private String doUpdate(String sw) { StringBuffer updateResult = new StringBuffer(); try { // open connection log.info(Connecting to and preparing to post to SolrUpdate servlet.); URL url = new URL(http://localhost:8080/update;); HttpURLConnection conn = (HttpURLConnection) url.openConnection(); conn.setRequestMethod(POST); conn.setRequestProperty(Content-Type, application/octet-stream); conn.setDoOutput(true); conn.setDoInput(true); conn.setUseCaches(false); // Write to server log.info(About to post to SolrUpdate servlet.); DataOutputStream output = new DataOutputStream(conn.getOutputStream ()); output.writeBytes(sw); output.flush(); output.close(); log.info(Finished posting to SolrUpdate servlet.); // Read response log.info(Ready to read response.); BufferedReader rd = new BufferedReader(new InputStreamReader( conn.getInputStream())); log.info(Got reader); String line; while ((line = rd.readLine()) != null) { log.info(Writing to result...); updateResult.append(line); } rd.close(); // close connections conn.disconnect(); log.info(Done updating Solr for site + updateSite); } catch (Exception e) { e.printStackTrace(); } return updateResult.toString(); } } -Sangraal On 7/27/06, Yonik Seeley [EMAIL PROTECTED] wrote: Are you reading the response and closing the connection? If not, you are probably running out of socket connections. -Yonik On 7/27/06, sangraal aiken [EMAIL PROTECTED] wrote: Yonik, It looks like the problem is with the way I'm posting to the SolrUpdate servlet. I am able to use curl to post the data to my tomcat instance without a problem. It only fails when I try to handle the http post from java... my code is below: URL url = new URL(http://localhost:8983/solr/update;); HttpURLConnection conn = (HttpURLConnection) url.openConnection(); conn.setRequestMethod(POST); conn.setRequestProperty(Content-Type, application/octet-stream); conn.setDoOutput(true); conn.setDoInput(true); conn.setUseCaches(false); // Write to server log.info(About to post to SolrUpdate servlet.); DataOutputStream output = new DataOutputStream( conn.getOutputStream ()); output.writeBytes(sw); output.flush(); log.info(Finished posting to SolrUpdate servlet.); -Sangraal On 7/27/06, Yonik Seeley [EMAIL PROTECTED] wrote: On 7/26/06, sangraal aiken [EMAIL PROTECTED] wrote: I removed everything from the Add xml so the docs looked like this: doc field name=id187880/field /doc doc field name=id187852/field /doc and it still hung at 6,144... Maybe you can try the following simple Python client to try and rule out some kind of different client interactions... the attached script adds 10,000 documents and works fine for me in WinXP w/ Tomcat 5.5.17 and Jetty -Yonik solr.py -- import httplib import socket class SolrConnection: def __init__(self, host='localhost:8983', solrBase='/solr'): self.host = host self.solrBase = solrBase #a connection to the server is not opened at this point. self.conn = httplib.HTTPConnection(self.host) #self.conn.set_debuglevel(100) self.postheaders = {Connection:close} def doUpdateXML(self, request): try: self.conn.request('POST', self.solrBase+'/update', request, self.postheaders) except (socket.error,httplib.CannotSendRequest) : #reconnect in case the connection was broken from the server going down, #the server timing out our persistent connection, or another #network failure. #Also catch httplib.CannotSendRequest because the HTTPConnection object #can get in a bad state. self.conn.close() self.conn.connect() self.conn.request('POST', self.solrBase+'/update', request, self.postheaders) rsp = self.conn.getresponse() #print rsp.status, rsp.reason data = rsp.read() #print data=,data self.conn.close() def delete(self, id): xstr = 'deleteid'+id+'/id/delete' self.doUpdateXML(xstr) def add(self, **fields): #todo: XML escaping flist=['field name=%s%s/field' % f for f in fields.items() ] flist.insert(0,'adddoc') flist.append('/doc/add') xstr = ''.join(flist) self.doUpdateXML(xstr) c = SolrConnection() #for i in range(1): # c.delete(str(i)) for i in range(1): c.add(id=i)
Re: Doc add limit
I haven't been following the thread, but Not sure if you are using Tomcat or Jetty, but Jetty has a POST size limit (set somewhere in its configs) that may be the source of the problem. Otis P.S. Just occurred to me. Tomcat. Jetty. Tom Jerry. Jetty guys should have called their thing Jerry or Jerrymouse. - Original Message From: Mike Klaas [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Thursday, July 27, 2006 6:33:16 PM Subject: Re: Doc add limit Hi Sangraal: Sorry--I tried not to imply that this might affect your issue. You may have to crank up the solr logging to determine where it is freezing (and what might be happening). It is certainly worth investigating why this occurs, but I wonder about the advantages of using such huge batches. Assuming a few hundred bytes per document, 6100 docs produces a POST over 1MB in size. -Mike On 7/27/06, sangraal aiken [EMAIL PROTECTED] wrote: Mike, I've been posting with the content type set like this: conn.setRequestProperty(Content-Type, application/octet-stream); I tried your suggestion though, and unfortunately there was no change. conn.setRequestProperty(Content-Type, text/xml; charset=utf-8); -Sangraal On 7/27/06, Mike Klaas [EMAIL PROTECTED] wrote: On 7/27/06, Yonik Seeley [EMAIL PROTECTED] wrote: class SolrConnection: def __init__(self, host='localhost:8983', solrBase='/solr'): self.host = host self.solrBase = solrBase #a connection to the server is not opened at this point. self.conn = httplib.HTTPConnection(self.host) #self.conn.set_debuglevel(100) self.postheaders = {Connection:close} def doUpdateXML(self, request): try: self.conn.request('POST', self.solrBase+'/update', request, self.postheaders) Disgressive note: I'm not sure if it is necessary with tomcat, but in my experience driving solr with python using Jetty, it was necessary to specify the content-type when posting utf-8 data: self.postheaders.update({'Content-Type': 'text/xml; charset=utf-8'}) -Mike
Re: Doc add limit
I'm running on Tomcat... and I've verified that the complete post is making it through the SolrUpdate servlet and into the SolrCore object... thanks for the info though. -- So the code is hanging on this call in SolrCore.java writer.write(result status=\ + status + \/result); The thread dump: http-8080-Processor24 Id=32 in RUNNABLE (running in native) total cpu time=40698.0440ms user time=38646.1680ms at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) at java.net.SocketOutputStream.write(SocketOutputStream.java:136) at org.apache.coyote.http11.InternalOutputBuffer.realWriteBytes( InternalOutputBuffer.java:746) at org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:433) at org.apache.tomcat.util.buf.ByteChunk.append(ByteChunk.java:348) at org.apache.coyote.http11.InternalOutputBuffer$OutputStreamOutputBuffer.doWrite (InternalOutputBuffer.java:769) at org.apache.coyote.http11.filters.ChunkedOutputFilter.doWrite( ChunkedOutputFilter.java:125) at org.apache.coyote.http11.InternalOutputBuffer.doWrite( InternalOutputBuffer.java:579) at org.apache.coyote.Response.doWrite(Response.java:559) at org.apache.catalina.connector.OutputBuffer.realWriteBytes( OutputBuffer.java:361) at org.apache.tomcat.util.buf.ByteChunk.append(ByteChunk.java:324) at org.apache.tomcat.util.buf.IntermediateOutputStream.write( C2BConverter.java:235) at sun.nio.cs.StreamEncoder$CharsetSE.writeBytes(StreamEncoder.java :336) at sun.nio.cs.StreamEncoder$CharsetSE.implFlushBuffer( StreamEncoder.java:404) at sun.nio.cs.StreamEncoder$CharsetSE.implFlush(StreamEncoder.java:408) at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:152) at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:213) at org.apache.tomcat.util.buf.WriteConvertor.flush(C2BConverter.java :184) at org.apache.tomcat.util.buf.C2BConverter.flushBuffer( C2BConverter.java:127) at org.apache.catalina.connector.OutputBuffer.realWriteChars( OutputBuffer.java:536) at org.apache.tomcat.util.buf.CharChunk.flushBuffer(CharChunk.java:439) at org.apache.tomcat.util.buf.CharChunk.append(CharChunk.java:370) at org.apache.catalina.connector.OutputBuffer.write(OutputBuffer.java :491) at org.apache.catalina.connector.CoyoteWriter.write(CoyoteWriter.java :161) at org.apache.catalina.connector.CoyoteWriter.write(CoyoteWriter.java :170) at org.apache.solr.core.SolrCore.update(SolrCore.java:695) at org.apache.solr.servlet.SolrUpdateServlet.doPost( SolrUpdateServlet.java:52) at javax.servlet.http.HttpServlet.service(HttpServlet.java:709) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter( ApplicationFilterChain.java:252) at org.apache.catalina.core.ApplicationFilterChain.doFilter( ApplicationFilterChain.java:173) at org.apache.catalina.core.StandardWrapperValve.invoke( StandardWrapperValve.java:213) at org.apache.catalina.core.StandardContextValve.invoke( StandardContextValve.java:178) at org.apache.catalina.core.StandardHostValve.invoke( StandardHostValve.java:126) at org.apache.catalina.valves.ErrorReportValve.invoke( ErrorReportValve.java:105) at org.apache.catalina.core.StandardEngineValve.invoke( StandardEngineValve.java:107) at org.apache.catalina.connector.CoyoteAdapter.service( CoyoteAdapter.java:148) at org.apache.coyote.http11.Http11Processor.process( Http11Processor.java:869) at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection (Http11BaseProtocol.java:664) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket( PoolTcpEndpoint.java:527) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt( LeaderFollowerWorkerThread.java:80) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run( ThreadPool.java:684) at java.lang.Thread.run(Thread.java:613) On 7/27/06, Otis Gospodnetic [EMAIL PROTECTED] wrote: I haven't been following the thread, but Not sure if you are using Tomcat or Jetty, but Jetty has a POST size limit (set somewhere in its configs) that may be the source of the problem. Otis P.S. Just occurred to me. Tomcat. Jetty. Tom Jerry. Jetty guys should have called their thing Jerry or Jerrymouse. - Original Message From: Mike Klaas [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Thursday, July 27, 2006 6:33:16 PM Subject: Re: Doc add limit Hi Sangraal: Sorry--I tried not to imply that this might affect your issue. You may have to crank up the solr logging to determine where it is freezing (and what might be happening). It is certainly worth investigating why this occurs, but I wonder about the advantages of using such huge batches. Assuming a few hundred bytes per document, 6100 docs produces a POST over 1MB in size. -Mike On
Re: Doc add limit
You might also try the Java update client here: http://issues.apache.org/jira/browse/SOLR-20 -Yonik
Re: Doc add limit
On 7/27/06, sangraal aiken [EMAIL PROTECTED] wrote: Commenting out the following line in SolrCore fixes my problem... but of course I don't get the result status info... but this isn't a problem for me really. -Sangraal writer.write(result status=\ + status + \/result); While it's possible you hit a Tomcat bug, I think it's more likely a client problem. -Yonik
Re: Doc add limit
I'll give that a shot... Thanks again for all your help. -S On 7/27/06, Yonik Seeley [EMAIL PROTECTED] wrote: You might also try the Java update client here: http://issues.apache.org/jira/browse/SOLR-20 -Yonik