Re: Own Similarity Class in Solr

2006-07-27 Thread Chris Hostetter

:I would like to alter the similarity behaviour of solr to remove
: the fieldnorm factor in  the similarity calculations. As far as I
: read, I need to recreate my own similarity class and import it into
: solr using the similarity config in schema.xml.
:
:Has anybody already tweaked or played with this topic, and might
: give me some code or advices ?

as you're already noticed, you can specify the Similarity class at runtime
via the schema.xml -- the only Solr specific aspect of this making sure
your Similarity class is in your servlet containers classpath (exactly how
you do this depends on your servlet container)

searching the java-dev and java-user Lucene mailing lists is the best bet
for finding discussions on writing your own similarity, there are also
some examples in the main Lucene code base...

contrib/miscellaneous/src/java/org/apache/lucene/misc/SweetSpotSimilarity.java
src/test/org/apache/lucene/search/TestDisjunctionMaxQuery.java

...if your main interest is just eliminating norms, there is a special
option for that in Lucene Fields called Omit Norms (it not only
eliminates the effects of norms on scoring, but it saves space in your
index as well) in Solr you can turn it on/off per field or fieldtype
using the omitNorms=true option in the schema.xml



-Hoss



Re: Own Similarity Class in Solr

2006-07-27 Thread Tom Weber

Hi Chris,

  thanks for the details, I am meanwhile poking around with my own  
class which I defined in the schema.xml everything is working  
perfectly there.


  But I have still the problem with the normalization, I try to  
change several parameters to fix it to 1.0, this does indeed change  
the scoring but still not the real way I need it. It seems that it is  
always the fieldNorm which is playing, but where is this field  
really from ? In the Similarity Class I don't find this term to alter.


  Let me give a short example what goes wrong :

  I have a field searchname with a boost of 3.0 during the  
document.add. Another field text is a copyField of several entries,  
this one does not have a boost factor, but indeed more data in it. In  
this text is a copy of a field where the text searched is in there  
3times. This entry has the score : 5.5930133


  But I have also entries where the searchname has the same word in  
it, but this one have a score of 1.9975047.


  Currently my class is like this (I took the DefaultSimilarity as a  
basis);


  - lengthNorm is fixed to 1.0
  - tf fixed to 1.0
  - idf fixed to 1.0

  With these changes, might it be possible that I've deactivated the  
boost on the different Fields.


  What I need is, a search, which will handle each document the  
same, regardless of the frequency and the size, it shall calculate  
the score only on the boost factors, so a document with a hight  
boostfactor and the same text in it as another one with less factor  
shall be before the others.


  Something I do might be completely wrong, perhaps You have an idea ?

  Thanks,

   Tom


add/update index

2006-07-27 Thread Tricia Williams

Hi,

   I have created a process which uses xsl to convert my data to the form 
indicated in the examples so that it can be added to the index as the solr 
tutorial indicates:

add
  doc
field name=fieldvalue/field
...
  /doc
/add

   In some cases the xsl process will create a field element with no data. 
(ie field name=field/)  Is this considered bad input and will not be 
accepted?  Or is this something that solr should deal with?  Currently for 
each field element with no data I receive the message:

result status=1java.lang.NullPointerException
 at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:78)
 at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:74)
 at org.apache.solr.core.SolrCore.readDoc(SolrCore.java:917)
 at org.apache.solr.core.SolrCore.update(SolrCore.java:685)
 at org.apache.solr.servlet.SolrUpdateServlet.doPost(SolrUpdateServlet.java:52)
 ...
/result

   Just curious if the gurus out there think I should deal with the null 
values in my xsl process or if this can be dealt with in solr itself?


Thanks,
Tricia

ps.  Thanks for the timely fix for the UTF-8 issue!


Re: add/update index

2006-07-27 Thread Tricia Williams

Thanks Yonik,

   That's exactly what I needed to know.  I'll adapt my xsl process to 
omit null values.


Tricia

On Thu, 27 Jul 2006, Yonik Seeley wrote:


On 7/27/06, Tricia Williams [EMAIL PROTECTED] wrote:

Hi,

I have created a process which uses xsl to convert my data to the form
indicated in the examples so that it can be added to the index as the solr
tutorial indicates:
add
   doc
 field name=fieldvalue/field
 ...
   /doc
/add

In some cases the xsl process will create a field element with no data.
(ie field name=field/)  Is this considered bad input and will not be
accepted?


If the desired semantics are the field doesn't exist or null value
then yes.  There isn't a way to represent a field without a value in
Lucene except to not add the field for that document.  If it's totally
ignored, it probably shouldn't be in the XML.

Now, one might think we could drop fields with no value, but that's
problematic because it goes against the XML standard:

http://www.w3.org/TR/REC-xml/#sec-starttags
[Definition: An element with no content is said to be empty.] The
representation of an empty element is either a start-tag immediately
followed by an end-tag, or an empty-element tag. [Definition: An
empty-element tag takes a special form:]

So a/a and a/ are supposed to be equivalent.  Given that, it
does look like Solr should treat field name=val/ like a
zero-length string (but that's not what you wanted, right?)

-Yonik



Solr's JSON, Python, Ruby output format

2006-07-27 Thread Yonik Seeley

Solr now has a JSON response format, in addition to Python and Ruby
versions that can be directly eval'd.

http://wiki.apache.org/solr/SolJSON

-Yonik


Re: Doc add limit

2006-07-27 Thread Yonik Seeley

On 7/26/06, sangraal aiken [EMAIL PROTECTED] wrote:

I removed everything from the Add xml so the docs looked like this:

doc
field name=id187880/field
/doc
doc
field name=id187852/field
/doc

and it still hung at 6,144...


Maybe you can try the following simple Python client to try and rule
out some kind of different client interactions... the attached script
adds 10,000 documents and works fine for me in WinXP w/ Tomcat 5.5.17
and Jetty

-Yonik


 solr.py --
import httplib
import socket

class SolrConnection:
 def __init__(self, host='localhost:8983', solrBase='/solr'):
   self.host = host
   self.solrBase = solrBase
   #a connection to the server is not opened at this point.
   self.conn = httplib.HTTPConnection(self.host)
   #self.conn.set_debuglevel(100)
   self.postheaders = {Connection:close}

 def doUpdateXML(self, request):
   try:
 self.conn.request('POST', self.solrBase+'/update', request,
self.postheaders)
   except (socket.error,httplib.CannotSendRequest) :
 #reconnect in case the connection was broken from the server going down,
 #the server timing out our persistent connection, or another
 #network failure.
 #Also catch httplib.CannotSendRequest because the HTTPConnection object
 #can get in a bad state.
 self.conn.close()
 self.conn.connect()
 self.conn.request('POST', self.solrBase+'/update', request,
self.postheaders)

   rsp = self.conn.getresponse()
   #print rsp.status, rsp.reason
   data = rsp.read()
   #print data=,data
   self.conn.close()

 def delete(self, id):
   xstr = 'deleteid'+id+'/id/delete'
   self.doUpdateXML(xstr)

 def add(self, **fields):
   #todo: XML escaping
   flist=['field name=%s%s/field' % f for f in fields.items() ]
   flist.insert(0,'adddoc')
   flist.append('/doc/add')
   xstr = ''.join(flist)
   self.doUpdateXML(xstr)

c = SolrConnection()
#for i in range(1):
#  c.delete(str(i))
for i in range(1):
 c.add(id=i)


Re: Doc add limit

2006-07-27 Thread Mike Klaas

On 7/27/06, Yonik Seeley [EMAIL PROTECTED] wrote:


class SolrConnection:
  def __init__(self, host='localhost:8983', solrBase='/solr'):
self.host = host
self.solrBase = solrBase
#a connection to the server is not opened at this point.
self.conn = httplib.HTTPConnection(self.host)
#self.conn.set_debuglevel(100)
self.postheaders = {Connection:close}

  def doUpdateXML(self, request):
try:
  self.conn.request('POST', self.solrBase+'/update', request,
self.postheaders)


Disgressive note: I'm not sure if it is necessary with tomcat, but in
my experience driving solr with python using Jetty, it was necessary
to specify the content-type when posting utf-8 data:

self.postheaders.update({'Content-Type': 'text/xml; charset=utf-8'})

-Mike


Re: Doc add limit

2006-07-27 Thread Mike Klaas

Hi Sangraal:

Sorry--I tried not to imply that this might affect your issue.  You
may have to crank up the solr logging to determine where it is
freezing (and what might be happening).

It is certainly worth investigating why this occurs, but I wonder
about the advantages of using such huge batches.  Assuming a few
hundred bytes per document, 6100 docs produces a POST over 1MB in
size.

-Mike

On 7/27/06, sangraal aiken [EMAIL PROTECTED] wrote:

Mike,
 I've been posting with the content type set like this:
  conn.setRequestProperty(Content-Type, application/octet-stream);

I tried your suggestion though, and unfortunately there was no change.
  conn.setRequestProperty(Content-Type, text/xml; charset=utf-8);

-Sangraal


On 7/27/06, Mike Klaas [EMAIL PROTECTED] wrote:

 On 7/27/06, Yonik Seeley [EMAIL PROTECTED] wrote:

  class SolrConnection:
def __init__(self, host='localhost:8983', solrBase='/solr'):
  self.host = host
  self.solrBase = solrBase
  #a connection to the server is not opened at this point.
  self.conn = httplib.HTTPConnection(self.host)
  #self.conn.set_debuglevel(100)
  self.postheaders = {Connection:close}
 
def doUpdateXML(self, request):
  try:
self.conn.request('POST', self.solrBase+'/update', request,
  self.postheaders)

 Disgressive note: I'm not sure if it is necessary with tomcat, but in
 my experience driving solr with python using Jetty, it was necessary
 to specify the content-type when posting utf-8 data:

 self.postheaders.update({'Content-Type': 'text/xml; charset=utf-8'})

 -Mike





Re: Doc add limit

2006-07-27 Thread sangraal aiken

Yeah, I'm closing them.  Here's the method:

-
 private String doUpdate(String sw) {
   StringBuffer updateResult = new StringBuffer();
   try {
 // open connection
 log.info(Connecting to and preparing to post to SolrUpdate
servlet.);
 URL url = new URL(http://localhost:8080/update;);
 HttpURLConnection conn = (HttpURLConnection) url.openConnection();
 conn.setRequestMethod(POST);
 conn.setRequestProperty(Content-Type, application/octet-stream);
 conn.setDoOutput(true);
 conn.setDoInput(true);
 conn.setUseCaches(false);

 // Write to server
 log.info(About to post to SolrUpdate servlet.);
 DataOutputStream output = new DataOutputStream(conn.getOutputStream
());
 output.writeBytes(sw);
 output.flush();
 output.close();
 log.info(Finished posting to SolrUpdate servlet.);

 // Read response
 log.info(Ready to read response.);
 BufferedReader rd = new BufferedReader(new InputStreamReader(
conn.getInputStream()));
 log.info(Got reader);
 String line;
 while ((line = rd.readLine()) != null) {
   log.info(Writing to result...);
   updateResult.append(line);
 }
 rd.close();

 // close connections
 conn.disconnect();

 log.info(Done updating Solr for site + updateSite);
   } catch (Exception e) {
 e.printStackTrace();
   }

   return updateResult.toString();
 }
}

-Sangraal

On 7/27/06, Yonik Seeley [EMAIL PROTECTED] wrote:


Are you reading the response and closing the connection?  If not, you
are probably running out of socket connections.

-Yonik

On 7/27/06, sangraal aiken [EMAIL PROTECTED] wrote:
 Yonik,
 It looks like the problem is with the way I'm posting to the SolrUpdate
 servlet. I am able to use curl to post the data to my tomcat instance
 without a problem. It only fails when I try to handle the http post from
 java... my code is below:

   URL url = new URL(http://localhost:8983/solr/update;);
   HttpURLConnection conn = (HttpURLConnection) url.openConnection();
   conn.setRequestMethod(POST);
   conn.setRequestProperty(Content-Type,
application/octet-stream);
   conn.setDoOutput(true);
   conn.setDoInput(true);
   conn.setUseCaches(false);

   // Write to server
   log.info(About to post to SolrUpdate servlet.);
   DataOutputStream output = new DataOutputStream(
conn.getOutputStream
 ());
   output.writeBytes(sw);
   output.flush();
   log.info(Finished posting to SolrUpdate servlet.);

 -Sangraal

 On 7/27/06, Yonik Seeley [EMAIL PROTECTED] wrote:
 
  On 7/26/06, sangraal aiken [EMAIL PROTECTED] wrote:
   I removed everything from the Add xml so the docs looked like this:
  
   doc
   field name=id187880/field
   /doc
   doc
   field name=id187852/field
   /doc
  
   and it still hung at 6,144...
 
  Maybe you can try the following simple Python client to try and rule
  out some kind of different client interactions... the attached script
  adds 10,000 documents and works fine for me in WinXP w/ Tomcat 5.5.17
  and Jetty
 
  -Yonik
 
 
   solr.py --
  import httplib
  import socket
 
  class SolrConnection:
def __init__(self, host='localhost:8983', solrBase='/solr'):
  self.host = host
  self.solrBase = solrBase
  #a connection to the server is not opened at this point.
  self.conn = httplib.HTTPConnection(self.host)
  #self.conn.set_debuglevel(100)
  self.postheaders = {Connection:close}
 
def doUpdateXML(self, request):
  try:
self.conn.request('POST', self.solrBase+'/update', request,
  self.postheaders)
  except (socket.error,httplib.CannotSendRequest) :
#reconnect in case the connection was broken from the server
going
  down,
#the server timing out our persistent connection, or another
#network failure.
#Also catch httplib.CannotSendRequest because the HTTPConnection
  object
#can get in a bad state.
self.conn.close()
self.conn.connect()
self.conn.request('POST', self.solrBase+'/update', request,
  self.postheaders)
 
  rsp = self.conn.getresponse()
  #print rsp.status, rsp.reason
  data = rsp.read()
  #print data=,data
  self.conn.close()
 
def delete(self, id):
  xstr = 'deleteid'+id+'/id/delete'
  self.doUpdateXML(xstr)
 
def add(self, **fields):
  #todo: XML escaping
  flist=['field name=%s%s/field' % f for f in fields.items() ]
  flist.insert(0,'adddoc')
  flist.append('/doc/add')
  xstr = ''.join(flist)
  self.doUpdateXML(xstr)
 
  c = SolrConnection()
  #for i in range(1):
  #  c.delete(str(i))
  for i in range(1):
c.add(id=i)



Re: Doc add limit

2006-07-27 Thread Otis Gospodnetic
I haven't been following the thread, but
Not sure if you are using Tomcat or Jetty, but Jetty has a POST size limit (set 
somewhere in its configs) that may be the source of the problem.

Otis
P.S.
Just occurred to me.
Tomcat.  Jetty.  Tom  Jerry.  Jetty guys should have called their thing Jerry 
or Jerrymouse.

- Original Message 
From: Mike Klaas [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Thursday, July 27, 2006 6:33:16 PM
Subject: Re: Doc add limit

Hi Sangraal:

Sorry--I tried not to imply that this might affect your issue.  You
may have to crank up the solr logging to determine where it is
freezing (and what might be happening).

It is certainly worth investigating why this occurs, but I wonder
about the advantages of using such huge batches.  Assuming a few
hundred bytes per document, 6100 docs produces a POST over 1MB in
size.

-Mike

On 7/27/06, sangraal aiken [EMAIL PROTECTED] wrote:
 Mike,
  I've been posting with the content type set like this:
   conn.setRequestProperty(Content-Type, application/octet-stream);

 I tried your suggestion though, and unfortunately there was no change.
   conn.setRequestProperty(Content-Type, text/xml; charset=utf-8);

 -Sangraal


 On 7/27/06, Mike Klaas [EMAIL PROTECTED] wrote:
 
  On 7/27/06, Yonik Seeley [EMAIL PROTECTED] wrote:
 
   class SolrConnection:
 def __init__(self, host='localhost:8983', solrBase='/solr'):
   self.host = host
   self.solrBase = solrBase
   #a connection to the server is not opened at this point.
   self.conn = httplib.HTTPConnection(self.host)
   #self.conn.set_debuglevel(100)
   self.postheaders = {Connection:close}
  
 def doUpdateXML(self, request):
   try:
 self.conn.request('POST', self.solrBase+'/update', request,
   self.postheaders)
 
  Disgressive note: I'm not sure if it is necessary with tomcat, but in
  my experience driving solr with python using Jetty, it was necessary
  to specify the content-type when posting utf-8 data:
 
  self.postheaders.update({'Content-Type': 'text/xml; charset=utf-8'})
 
  -Mike
 







Re: Doc add limit

2006-07-27 Thread sangraal aiken

I'm running on Tomcat... and I've verified that the complete post is making
it through the SolrUpdate servlet and into the SolrCore object... thanks for
the info though.
--
So the code is hanging on this call in SolrCore.java

   writer.write(result status=\ + status + \/result);

The thread dump:

http-8080-Processor24 Id=32 in RUNNABLE (running in native) total cpu
time=40698.0440ms user time=38646.1680ms
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
at org.apache.coyote.http11.InternalOutputBuffer.realWriteBytes(
InternalOutputBuffer.java:746)
at org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:433)
at org.apache.tomcat.util.buf.ByteChunk.append(ByteChunk.java:348)
at
org.apache.coyote.http11.InternalOutputBuffer$OutputStreamOutputBuffer.doWrite
(InternalOutputBuffer.java:769)
at org.apache.coyote.http11.filters.ChunkedOutputFilter.doWrite(
ChunkedOutputFilter.java:125)
at org.apache.coyote.http11.InternalOutputBuffer.doWrite(
InternalOutputBuffer.java:579)
at org.apache.coyote.Response.doWrite(Response.java:559)
at org.apache.catalina.connector.OutputBuffer.realWriteBytes(
OutputBuffer.java:361)
at org.apache.tomcat.util.buf.ByteChunk.append(ByteChunk.java:324)
at org.apache.tomcat.util.buf.IntermediateOutputStream.write(
C2BConverter.java:235)
at sun.nio.cs.StreamEncoder$CharsetSE.writeBytes(StreamEncoder.java
:336)
at sun.nio.cs.StreamEncoder$CharsetSE.implFlushBuffer(
StreamEncoder.java:404)
at sun.nio.cs.StreamEncoder$CharsetSE.implFlush(StreamEncoder.java:408)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:152)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:213)
at org.apache.tomcat.util.buf.WriteConvertor.flush(C2BConverter.java
:184)
at org.apache.tomcat.util.buf.C2BConverter.flushBuffer(
C2BConverter.java:127)
at org.apache.catalina.connector.OutputBuffer.realWriteChars(
OutputBuffer.java:536)
at org.apache.tomcat.util.buf.CharChunk.flushBuffer(CharChunk.java:439)
at org.apache.tomcat.util.buf.CharChunk.append(CharChunk.java:370)
at org.apache.catalina.connector.OutputBuffer.write(OutputBuffer.java
:491)
at org.apache.catalina.connector.CoyoteWriter.write(CoyoteWriter.java
:161)
at org.apache.catalina.connector.CoyoteWriter.write(CoyoteWriter.java
:170)
at org.apache.solr.core.SolrCore.update(SolrCore.java:695)
at org.apache.solr.servlet.SolrUpdateServlet.doPost(
SolrUpdateServlet.java:52)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:709)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(
ApplicationFilterChain.java:252)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(
ApplicationFilterChain.java:173)
at org.apache.catalina.core.StandardWrapperValve.invoke(
StandardWrapperValve.java:213)
at org.apache.catalina.core.StandardContextValve.invoke(
StandardContextValve.java:178)
at org.apache.catalina.core.StandardHostValve.invoke(
StandardHostValve.java:126)
at org.apache.catalina.valves.ErrorReportValve.invoke(
ErrorReportValve.java:105)
at org.apache.catalina.core.StandardEngineValve.invoke(
StandardEngineValve.java:107)
at org.apache.catalina.connector.CoyoteAdapter.service(
CoyoteAdapter.java:148)
at org.apache.coyote.http11.Http11Processor.process(
Http11Processor.java:869)
at
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection
(Http11BaseProtocol.java:664)
at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(
PoolTcpEndpoint.java:527)
at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(
LeaderFollowerWorkerThread.java:80)
at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(
ThreadPool.java:684)
at java.lang.Thread.run(Thread.java:613)

On 7/27/06, Otis Gospodnetic [EMAIL PROTECTED] wrote:


I haven't been following the thread, but
Not sure if you are using Tomcat or Jetty, but Jetty has a POST size limit
(set somewhere in its configs) that may be the source of the problem.

Otis
P.S.
Just occurred to me.
Tomcat.  Jetty.  Tom  Jerry.  Jetty guys should have called their thing
Jerry or Jerrymouse.

- Original Message 
From: Mike Klaas [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Thursday, July 27, 2006 6:33:16 PM
Subject: Re: Doc add limit

Hi Sangraal:

Sorry--I tried not to imply that this might affect your issue.  You
may have to crank up the solr logging to determine where it is
freezing (and what might be happening).

It is certainly worth investigating why this occurs, but I wonder
about the advantages of using such huge batches.  Assuming a few
hundred bytes per document, 6100 docs produces a POST over 1MB in
size.

-Mike

On 

Re: Doc add limit

2006-07-27 Thread Yonik Seeley

You might also try the Java update client here:
http://issues.apache.org/jira/browse/SOLR-20

-Yonik


Re: Doc add limit

2006-07-27 Thread Yonik Seeley

On 7/27/06, sangraal aiken [EMAIL PROTECTED] wrote:

Commenting out the following line in SolrCore fixes my problem... but of
course I don't get the result status info... but this isn't a problem for me
really.

-Sangraal

writer.write(result status=\ + status + \/result);


While it's possible you hit a Tomcat bug, I think it's more likely a
client problem.

-Yonik


Re: Doc add limit

2006-07-27 Thread sangraal aiken

I'll give that a shot...

Thanks again for all your help.

-S

On 7/27/06, Yonik Seeley [EMAIL PROTECTED] wrote:


You might also try the Java update client here:
http://issues.apache.org/jira/browse/SOLR-20

-Yonik