Re: HTMLStripReader and script tags

2008-04-10 Thread Yonik Seeley
It was the intention to remove script.
I developed HTMLStripReader by just looking at a bunch of real-world HTML.
I hadn't run across script in uppercase, so I didn't do a case
insensitive check.

The code is currently:
if (name.equals("script") || name.equals("style")) {

Should be easy enough to change unless there is a good reason not to.

-Yonik

On Thu, Apr 10, 2008 at 5:05 AM, Walter Ferrara <[EMAIL PROTECTED]> wrote:
> I've noticed that passing html to a field using
> HTMLStripWhitespaceTokenizerFactory, ends up in having some javascripts too.
>  For example, using a analyzer like:
>  
>  
>
>  
>  
>
>  with a text such as:
>  
>  title
>  
>  pre
>  
>  var time = new Date();
>  ordval= (time.getTime());
>  
>  post 
>  
>  
>
>  Analysis.jsp turns out those tokens:
>  title
>  pre
>  var
>  time
>  =
>  new
>  Date();
>  ordval=
>  (time.getTime());
>  post
>
>  While if the script in the page is commented, everything works fine.
>  Is this due to design choice? Shouldn't scripts be removed in both cases?
>  (Solr Implementation Version: 2008-03-24_09-57-01 ${svnversion} - hudson -
> 2008-03-24 09:59:40)
>
>  Walter
>
>


Re: HTMLStripReader and script tags

2008-04-10 Thread Yonik Seeley
I've just committed a change to ignore case when comparing tag names.
-Yonik

On Thu, Apr 10, 2008 at 9:03 AM, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> It was the intention to remove script.
>  I developed HTMLStripReader by just looking at a bunch of real-world HTML.
>  I hadn't run across script in uppercase, so I didn't do a case
>  insensitive check.
>
>  The code is currently:
> if (name.equals("script") || name.equals("style")) {
>
>  Should be easy enough to change unless there is a good reason not to.
>
>  -Yonik
>
>
>
>  On Thu, Apr 10, 2008 at 5:05 AM, Walter Ferrara <[EMAIL PROTECTED]> wrote:
>  > I've noticed that passing html to a field using
>  > HTMLStripWhitespaceTokenizerFactory, ends up in having some javascripts 
> too.
>  >  For example, using a analyzer like:
>  >  
>  >  
>  >
>  >  
>  >  
>  >
>  >  with a text such as:
>  >  
>  >  title
>  >  
>  >  pre
>  >  
>  >  var time = new Date();
>  >  ordval= (time.getTime());
>  >  
>  >  post 
>  >  
>  >  
>  >
>  >  Analysis.jsp turns out those tokens:
>  >  title
>  >  pre
>  >  var
>  >  time
>  >  =
>  >  new
>  >  Date();
>  >  ordval=
>  >  (time.getTime());
>  >  post
>  >
>  >  While if the script in the page is commented, everything works fine.
>  >  Is this due to design choice? Shouldn't scripts be removed in both cases?
>  >  (Solr Implementation Version: 2008-03-24_09-57-01 ${svnversion} - hudson -
>  > 2008-03-24 09:59:40)
>  >
>  >  Walter
>  >
>  >
>


Re: Facet Query

2008-04-11 Thread Yonik Seeley
On Fri, Apr 11, 2008 at 4:32 PM, Lance Norskog <[EMAIL PROTECTED]> wrote:
> What do facet queries do that is different from the regular query?  What is
>  a use case where I would use a facet.query in addition to the regular query?

It returns the number of documents that match the query AND the facet.query.

-Yonik


Re: ODD Solr Error on Update POST - XMLStreamException: ParseError

2008-04-17 Thread Yonik Seeley
Since you've already tried different Solr versions and different JVM
versions, it's most likely Tomcat... try version 5.5.  If that doesn't
work, try a different OS (less likely, but it could be a libc bug or
something).

-Yonik

On Thu, Apr 17, 2008 at 3:28 PM, realw5 <[EMAIL PROTECTED]> wrote:
>
>  Hey All,
>
>  I've been beating my head on this problem with no luck on finding the cause.
>  I've done many nabble and google searches with no real solution. Let me
>  explain the problem. First my system setup:
>
>  Ubuntu 64bit Linux 6.06
>  Java 1.6 (amd64) (Also tried 1.5 with same results)
>  Solr 1.2 (also tryed 1.3-dev with same results)
>  Tomcat 6.0.16 (10 gigs of memory allocated) (also trying 5.5 with same
>  results)
>
>  Now the problem. When posting larger size documents, I am getting very
>  random errors. I can hit refresh on the page, and the error will change on
>  me. Here are some of those errors. The first error is the most common one,
>  althought the second one does come up too.
>
>  javax.xml.stream.XMLStreamException: ParseError at
>  [row,col]:[1,1] Message: Content is not allowed in prolog. at
>  
> com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:588)
>  at
>  
> org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:148)
>  at
>  
> org.apache.solr.handler.XmlUpdateRequestHandler.doLegacyUpdate(XmlUpdateRequestHandler.java:386)
>  at
>  org.apache.solr.servlet.SolrUpdateServlet.doPost(SolrUpdateServlet.java:65)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:710) at
>  javax.servlet.http.HttpServlet.service(HttpServlet.java:803) at
>  
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
>  at
>  
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>  at
>  
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:330)
>  at
>  
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>  at
>  
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>  at
>  
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
>  at
>  
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
>  at
>  org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
>  at
>  org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>  at
>  
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>  at
>  org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
>  at
>  org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
>  at
>  
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
>  at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
>  at java.lang.Thread.run(Thread.java:619) 
>
>  AND
>
>  javax.xml.stream.XMLStreamException: ParseError at
>  [row,col]:[2,8191] Message: XML document structures must start and end
>  within the same entity. at
>  
> com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:588)
>  at
>  
> org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:318)
>  at
>  
> org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:195)
>  at
>  
> org.apache.solr.handler.XmlUpdateRequestHandler.doLegacyUpdate(XmlUpdateRequestHandler.java:386)
>  at
>  org.apache.solr.servlet.SolrUpdateServlet.doPost(SolrUpdateServlet.java:65)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:710) at
>  javax.servlet.http.HttpServlet.service(HttpServlet.java:803) at
>  
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
>  at
>  
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>  at
>  
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:330)
>  at
>  
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>  at
>  
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>  at
>  
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
>  at
>  
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
>  at
>  org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
>  at
>  org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>  at
>  
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>  at
>  org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
>  at
>  org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
>

Re: ODD Solr Error on Update POST - XMLStreamException: ParseError

2008-04-17 Thread Yonik Seeley
On Thu, Apr 17, 2008 at 5:41 PM, realw5 <[EMAIL PROTECTED]> wrote:
>  Ok, so I tried tomcat 5.5, still not go. It might be helpful to note, that
>  when I decrease the size of the post (take fields out) I can get it to post
>  without error. It seems like it's barfing on a certain file size (buffer
>  issue maybe??). I'm running 32-bit Ubuntu on our production system and have
>  never seen these errors. Is it possible libc has a bug only in 64-bit
>  Ubuntu?
>
>  Lastly, I can try another OS...do you have any recommendations for a good
>  64-bit linux flavor?

Whatever you are comfortable with... if you don't already have
something lying around, perhaps the latest ubuntu beta (8.04)

Also double-check that you are sending exactly what you think you are.
If you haven't already, capture the XML you are sending to a file,
then use curl (the same version on the same box) to send the file to
both the server that works and the one that doesn't.

-Yonik


Re: ODD Solr Error on Update POST - XMLStreamException: ParseError

2008-04-17 Thread Yonik Seeley
On Thu, Apr 17, 2008 at 8:12 PM, Brian Johnson <[EMAIL PROTECTED]> wrote:
> The XML parser is probably not threadsafe but is being reused concurrently by 
> multiple post threads resulting in these exceptions.

Hmmm, yes, the factory is reused... here's the code we use to try and
make it thread-safe:

  @Override
  public void init(NamedList args)
  {
super.init(args);

inputFactory = BaseXMLInputFactory.newInstance();
try {
  // The java 1.6 bundled stax parser (sjsxp) does not currently
have a thread-safe
  // XMLInputFactory, as that implementation tries to cache and reuse the
  // XMLStreamReader.  Setting the parser-specific
"reuse-instance" property to false
  // prevents this.
  // All other known open-source stax parsers (and the bea ref impl)
  // have thread-safe factories.
  inputFactory.setProperty("reuse-instance", Boolean.FALSE);
}
catch( IllegalArgumentException ex ) {
  // Other implementations will likely throw this exception since
"reuse-instance"
  // isimplementation specific.
  log.fine( "Unable to set the 'reuse-instance' property for the
input factory: "+inputFactory );
}
  }


Dan: are you sending updates with multiple threads?  If so, can you
just try a single one at a time?

-Yonik



> The observed 'randomness' of the errors would be due to the unpredictable 
> nature of the race condition between threads. The reason you don't see this 
> with smaller documents would be that the likelihood of contention on small 
> documents is reduced because the race is eliminated. This would also be 
> generally independent of JVM, OS, memory allocation, etc as it seems to be.
>
>  I would look into how these classes/methods are dealing with the parser 
> factory. (keeping a static copy maybe?)
>
>
>  
> org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:148)
>
> org.apache.solr.handler.XmlUpdateRequestHandler.doLegacyUpdate(XmlUpdateRequestHandler.java:386)
>
> org.apache.solr.servlet.SolrUpdateServlet.doPost(SolrUpdateServlet.java:65)
>
>  This seems to me to be the most likely culprit given what I've seen so far 
> on this thread. I hope it helps.
>
>  -- Brian
>
>
>
>  - Original Message 
>  From: Yonik Seeley <[EMAIL PROTECTED]>
>  To: solr-user@lucene.apache.org
>  Sent: Thursday, April 17, 2008 3:28:08 PM
>  Subject: Re: ODD Solr Error on Update POST - XMLStreamException: ParseError
>
>  On Thu, Apr 17, 2008 at 5:41 PM, realw5 <[EMAIL PROTECTED]> wrote:
>  >  Ok, so I tried tomcat 5.5, still not go. It might be helpful to note, that
>  >  when I decrease the size of the post (take fields out) I can get it to 
> post
>  >  without error. It seems like it's barfing on a certain file size (buffer
>  >  issue maybe??). I'm running 32-bit Ubuntu on our production system and 
> have
>  >  never seen these errors. Is it possible libc has a bug only in 64-bit
>  >  Ubuntu?
>  >
>  >  Lastly, I can try another OS...do you have any recommendations for a good
>  >  64-bit linux flavor?
>
>  Whatever you are comfortable with... if you don't already have
>  something lying around, perhaps the latest ubuntu beta (8.04)
>
>  Also double-check that you are sending exactly what you think you are.
>  If you haven't already, capture the XML you are sending to a file,
>  then use curl (the same version on the same box) to send the file to
>  both the server that works and the one that doesn't.
>
>  -Yonik
>
>
>
>


Re: POST interface to sending queries to SOLR?

2008-04-21 Thread Yonik Seeley
On Mon, Apr 21, 2008 at 4:13 PM, Jim Adams <[EMAIL PROTECTED]> wrote:
> Could you point me to an example somewhere?

The command line tool "curl" can do either GET or POST:

curl http://localhost:8983/solr/select --data 'q=foo&rows=100'

-Yonik


Re: Updating in Solr.SOLR-139

2008-04-24 Thread Yonik Seeley
Koji: perhaps you are working with the "update" patch?  I'm pretty
sure these things won't work with stock solr, right?

-Yonik

On Fri, Apr 18, 2008 at 10:30 AM, Koji Sekiguchi <[EMAIL PROTECTED]> wrote:
> You don't need any additional attributes in schema.xml, but the field
>  should be stored.
>  You can overwrite the existing field value of the doc AAA w/
>  the following XML:
>
>  
>  
>   AAA
>   German
>  
>  
>
>  and post the following URL:
>
>  http://localhost:8389/solr/update?mode=tags:overwrite&commit=true
>
>  If the tags field is defined as multivalued="true", you can append new
> tags:
>
>  http://localhost:8389/solr/update?mode=tags:append&commit=true
>  
>  
>   AAA
>   Japanese
>   French
>  
>  
>
>  For number fields, you can use "increment" command.
>
>  Note that the mode parameter can be acceptable one or more name:command
> pairs:
>
>  mode=fieldName1:command1,fieldName2:command2,...
>
>  Thank you,
>
>  Koji
>
>
>
>
>  nutchvf wrote:
>
> > Hi!
> > There are any option to update a field (or a set of fields) of a document
> > indexed in Solr,without having to update all the fields of the entire
> > document???
> > I have seen the SOLR-139 patch,but  I do not know what is the proper
> syntax
> > of the command (or the xml to post) to update the document.Is required an
> > additional tag in the schema.xml describing the updatable property???
> >
> > For example:
> >
> >  > stored="true"/>
> > Please,I hope any suggestion!!!
> >
> > What is the xml required for the updating???For example,something like
> this:
> >
> > 
> > SOLR1000
> > 9
> > 
> >
> >
> > Regards..
> >
> >
>
>


Re: Updating in Solr.SOLR-139

2008-04-24 Thread Yonik Seeley
Apologies, I read too fast and didn't see that the original user was
in fact using the ModifiableDocument patch (that's what I was
referring to by "update patch").

On Thu, Apr 24, 2008 at 12:11 PM, Koji Sekiguchi <[EMAIL PROTECTED]> wrote:
> Yonik,
>
>  I'm afraid but I don't understand what you mean by "update patch".
>  I did this in last week with Eriks-ModifiableDocument.patch in
>  SOLR-139 and got working...
>
>  Koji
>
>
>
>  Yonik Seeley wrote:
>
> > Koji: perhaps you are working with the "update" patch?  I'm pretty
> > sure these things won't work with stock solr, right?
> >
> > -Yonik
> >
> > On Fri, Apr 18, 2008 at 10:30 AM, Koji Sekiguchi <[EMAIL PROTECTED]>
> wrote:
> >
> >
> > > You don't need any additional attributes in schema.xml, but the field
> > >  should be stored.
> > >  You can overwrite the existing field value of the doc AAA w/
> > >  the following XML:
> > >
> > >  
> > >  
> > >  AAA
> > >  German
> > >  
> > >  
> > >
> > >  and post the following URL:
> > >
> > >  http://localhost:8389/solr/update?mode=tags:overwrite&commit=true
> > >
> > >  If the tags field is defined as multivalued="true", you can append new
> > > tags:
> > >
> > >  http://localhost:8389/solr/update?mode=tags:append&commit=true
> > >  
> > >  
> > >  AAA
> > >  Japanese
> > >  French
> > >  
> > >  
> > >
> > >  For number fields, you can use "increment" command.
> > >
> > >  Note that the mode parameter can be acceptable one or more name:command
> > > pairs:
> > >
> > >  mode=fieldName1:command1,fieldName2:command2,...
> > >
> > >  Thank you,
> > >
> > >  Koji
> > >
> > >
> > >
> > >
> > >  nutchvf wrote:
> > >
> > >
> > >
> > > > Hi!
> > > > There are any option to update a field (or a set of fields) of a
> document
> > > > indexed in Solr,without having to update all the fields of the entire
> > > > document???
> > > > I have seen the SOLR-139 patch,but  I do not know what is the proper
> > > >
> > > >
> > > syntax
> > >
> > >
> > > > of the command (or the xml to post) to update the document.Is required
> an
> > > > additional tag in the schema.xml describing the updatable property???
> > > >
> > > > For example:
> > > >
> > > >  > > > stored="true"/>
> > > > Please,I hope any suggestion!!!
> > > >
> > > > What is the xml required for the updating???For example,something like
> > > >
> > > >
> > > this:
> > >
> > >
> > > > 
> > > > SOLR1000
> > > > 9
> > > > 
> > > >
> > > >
> > > > Regards..
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
> >
>
>


Re: Distributed Search Caching

2008-04-24 Thread Yonik Seeley
On Thu, Apr 24, 2008 at 3:16 PM, swarag <[EMAIL PROTECTED]> wrote:
>
>  hey,
>  I have a distributed search environment with one server hitting 3 shards.
>  for Example:
>  
> http://server1.cs.tmcs:15100/solr/search/?q=starbucks&shards=server1.cs.tmcs:8983/solr,server2.cs.tmcs:8983/solr,server3.cs.tmcs:8983/solr&collapse.field=locChainId
>  So, where is the cache stored? Is is distributed on the 3 servers or is it
>  on server1.cs.tmcs:15100?

It's distributed.
You don't really need server1.cs.tmcs:15100 (assuming it's a real
server) since it doesn't maintain any state... you can set "shards" as
a default parameter in a custom request handler on all 3 shards, then
query any of them to distribute the load.

-Yonik


Re: Multiple open SegmentReaders?

2008-04-30 Thread Yonik Seeley
Hmmm, if there is a bug, odds are it's due to multicore stuff  -
probably nothing else has touched core stuff like that recently.
Can you reproduce (or rather help others to reproduce) with the
solr/example setup?

-Yonik

On Wed, Apr 30, 2008 at 5:39 PM, Matthew Runo <[EMAIL PROTECTED]> wrote:
> Hello!
>
>  In using the SVN head version of Solr, I've found that recently we started
> getting multiple open SegmentReaders, all registered... etc..
>
>  Any ideas why this would happen? They don't go away unless the server is
> restarted, and don't go away with commits, etc. In fact, commits seem to
> cause the issue. They're causing issues since it causes really stale
> searchers to be around...
>
>  For example, right now...
>  org.apache.solr.search.SolrIndexSearcher
>  caching : true
>  numDocs : 153312
>  maxDoc : 153324
>  readerImpl : SegmentReader
>  readerDir : org.apache.lucene.store.FSDirectory@/opt/solr/data/index
>  indexVersion : 1205944085143
>  openedAt : Wed Apr 30 14:04:15 PDT 2008
>  registeredAt : Wed Apr 30 14:04:15 PDT 2008
>
>  (and right below that one...)
>  org.apache.solr.search.SolrIndexSearcher
>  caching : true
>  numDocs : 153312
>  maxDoc : 153324
>  readerImpl : SegmentReader
>  readerDir : org.apache.lucene.store.FSDirectory@/opt/solr/data/index
>  indexVersion : 1205944085143
>  openedAt : Wed Apr 30 14:30:02 PDT 2008
>  registeredAt : Wed Apr 30 14:30:02 PDT 2008
>
>  Thanks!
>
>  Matthew Runo
>  Software Developer
>  Zappos.com
>  702.943.7833
>
>


Re: token concat filter?

2008-05-01 Thread Yonik Seeley
If there are only a few such cases, it might be better to use synonyms
to correct them.
Off the top of my head there's no concatenating token filter, but it
wouldn't be hard to make one.

-Yonik

On Thu, May 1, 2008 at 8:44 AM, Geoffrey Young
<[EMAIL PROTECTED]> wrote:
> hi :)
>
>  I'm looking for a filter that will compress all tokens into a single token.
> the WordDelimiterFilterFactory does it for tokens it finds itself, but not
> ones passed to it.
>
>  basically, I'm trying to match
>
>   Radiohead
>
>  in the index with
>
>   radio head
>
>  in the query.  if it were spelled RadioHead or Radio-head in the index I'd
> find it, but as it is I'm missing it... unless I could squish all the query
> terms into a single token.  or maybe there's another route I haven't thought
> about yet.
>
>
>  --Geoff
>


Re: Sort results on a field not ordered

2008-05-02 Thread Yonik Seeley
On Fri, May 2, 2008 at 8:17 AM, Geoffrey Young
<[EMAIL PROTECTED]> wrote:
>  does this apply to facet fields as well?  I noticed that that if I set
> facet.sort="true" the results are indeed sorted by count... until the counts
> are the same, after which they are in random order (instead of ascii alpha).

facet.sort should be the default.
Ties in count are broken by order in the term index (not random).
This should correspond to alphabetical (ascii).

-Yonik


Re: Multiple open SegmentReaders?

2008-05-02 Thread Yonik Seeley
On Fri, May 2, 2008 at 1:08 PM, Matthew Runo <[EMAIL PROTECTED]> wrote:
> Hah, thank you for doing this. Sometimes I see MultiSegmentReaders,
> sometimes SegmentReaders, so both show up from time to time. Right now we've
> got two MultiSegmentReaders open..

OK, this implies there's a leak and the initial searcher that is
opened never gets closed.
Could you open a JIRA issue for this?

-Yonik


>
>  Thanks!
>
>  Matthew Runo
>  Software Developer
>  Zappos.com
>  702.943.7833
>
>
>  On May 1, 2008, at 7:19 PM, Koji Sekiguchi wrote:
>
> > I can reproduce with solr/example setup.
> > What I did:
> >
> > 1. $ svn co http://svn.apache.org/repos/asf/lucene/solr/trunk TEMP
> > 2. $ cd TEMP
> > 3. $ ant clean example
> > 4. $ cd example
> > 5. $ java -jar start.jar
> >
> > (to post commit)
> > 6. $ cd $SOLR_HOME/example/exampledocs
> > 7. $ ./post.sh
> >
> > then see admin>statistics. I can see MultiSegmentReader instead of
> > SegmentReader, though.
> >
> > name:  [EMAIL PROTECTED] main class:
> org.apache.solr.search.SolrIndexSearcher version: 1.0 description:
> index searcher stats: caching : true
> > numDocs : 0
> > maxDoc : 0
> > readerImpl : MultiSegmentReader
> > readerDir :
> [EMAIL PROTECTED]:\Project\jakarta\lucene\solr\TEMP\example\solr\data\index
> > indexVersion : 1209693930226
> > openedAt : Fri May 02 11:05:30 JST 2008
> > registeredAt : Fri May 02 11:05:30 JST 2008
> >  name: [EMAIL PROTECTED] main class:
> org.apache.solr.search.SolrIndexSearcher version: 1.0 description:
> index searcher stats: caching : true
> > numDocs : 0
> > maxDoc : 0
> > readerImpl : MultiSegmentReader
> > readerDir :
> [EMAIL PROTECTED]:\Project\jakarta\lucene\solr\TEMP\example\solr\data\index
> > indexVersion : 1209693930226
> > openedAt : Fri May 02 11:06:13 JST 2008
> > registeredAt : Fri May 02 11:06:13 JST 2008
> >
> > Koji
> >
> >
> > Yonik Seeley wrote:
> >
> > > Hmmm, if there is a bug, odds are it's due to multicore stuff  -
> > > probably nothing else has touched core stuff like that recently.
> > > Can you reproduce (or rather help others to reproduce) with the
> > > solr/example setup?
> > >
> > > -Yonik
> > >
> > > On Wed, Apr 30, 2008 at 5:39 PM, Matthew Runo <[EMAIL PROTECTED]> wrote:
> > >
> > >
> > > > Hello!
> > > >
> > > > In using the SVN head version of Solr, I've found that recently we
> started
> > > > getting multiple open SegmentReaders, all registered... etc..
> > > >
> > > > Any ideas why this would happen? They don't go away unless the server
> is
> > > > restarted, and don't go away with commits, etc. In fact, commits seem
> to
> > > > cause the issue. They're causing issues since it causes really stale
> > > > searchers to be around...
> > > >
> > > > For example, right now...
> > > > org.apache.solr.search.SolrIndexSearcher
> > > > caching : true
> > > > numDocs : 153312
> > > > maxDoc : 153324
> > > > readerImpl : SegmentReader
> > > > readerDir : org.apache.lucene.store.FSDirectory@/opt/solr/data/index
> > > > indexVersion : 1205944085143
> > > > openedAt : Wed Apr 30 14:04:15 PDT 2008
> > > > registeredAt : Wed Apr 30 14:04:15 PDT 2008
> > > >
> > > > (and right below that one...)
> > > > org.apache.solr.search.SolrIndexSearcher
> > > > caching : true
> > > > numDocs : 153312
> > > > maxDoc : 153324
> > > > readerImpl : SegmentReader
> > > > readerDir : org.apache.lucene.store.FSDirectory@/opt/solr/data/index
> > > > indexVersion : 1205944085143
> > > > openedAt : Wed Apr 30 14:30:02 PDT 2008
> > > > registeredAt : Wed Apr 30 14:30:02 PDT 2008
> > > >
> > > > Thanks!
> > > >
> > > > Matthew Runo
> > > > Software Developer
> > > > Zappos.com
> > > > 702.943.7833
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > >
> >
> >
>
>


Re: Multiple open SegmentReaders?

2008-05-02 Thread Yonik Seeley
This bug was introduced in SOLR-509 (committed April 17th).
I'm working on a fix now.

-Yonik

On Fri, May 2, 2008 at 2:32 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> On Fri, May 2, 2008 at 1:08 PM, Matthew Runo <[EMAIL PROTECTED]> wrote:
>  > Hah, thank you for doing this. Sometimes I see MultiSegmentReaders,
>  > sometimes SegmentReaders, so both show up from time to time. Right now 
> we've
>  > got two MultiSegmentReaders open..
>
>  OK, this implies there's a leak and the initial searcher that is
>  opened never gets closed.
>  Could you open a JIRA issue for this?
>
>  -Yonik
>
>
>
>
>  >
>  >  Thanks!
>  >
>  >  Matthew Runo
>  >  Software Developer
>  >  Zappos.com
>  >  702.943.7833
>  >
>  >
>  >  On May 1, 2008, at 7:19 PM, Koji Sekiguchi wrote:
>  >
>  > > I can reproduce with solr/example setup.
>  > > What I did:
>  > >
>  > > 1. $ svn co http://svn.apache.org/repos/asf/lucene/solr/trunk TEMP
>  > > 2. $ cd TEMP
>  > > 3. $ ant clean example
>  > > 4. $ cd example
>  > > 5. $ java -jar start.jar
>  > >
>  > > (to post commit)
>  > > 6. $ cd $SOLR_HOME/example/exampledocs
>  > > 7. $ ./post.sh
>  > >
>  > > then see admin>statistics. I can see MultiSegmentReader instead of
>  > > SegmentReader, though.
>  > >
>  > > name:  [EMAIL PROTECTED] main class:
>  > org.apache.solr.search.SolrIndexSearcher version: 1.0 description:
>  > index searcher stats: caching : true
>  > > numDocs : 0
>  > > maxDoc : 0
>  > > readerImpl : MultiSegmentReader
>  > > readerDir :
>  > [EMAIL PROTECTED]:\Project\jakarta\lucene\solr\TEMP\example\solr\data\index
>  > > indexVersion : 1209693930226
>  > > openedAt : Fri May 02 11:05:30 JST 2008
>  > > registeredAt : Fri May 02 11:05:30 JST 2008
>  > >  name: [EMAIL PROTECTED] main class:
>  > org.apache.solr.search.SolrIndexSearcher version: 1.0 description:
>  > index searcher stats: caching : true
>  > > numDocs : 0
>  > > maxDoc : 0
>  > > readerImpl : MultiSegmentReader
>  > > readerDir :
>  > [EMAIL PROTECTED]:\Project\jakarta\lucene\solr\TEMP\example\solr\data\index
>  > > indexVersion : 1209693930226
>  > > openedAt : Fri May 02 11:06:13 JST 2008
>  > > registeredAt : Fri May 02 11:06:13 JST 2008
>  > >
>  > > Koji
>  > >
>  > >
>  > > Yonik Seeley wrote:
>  > >
>  > > > Hmmm, if there is a bug, odds are it's due to multicore stuff  -
>  > > > probably nothing else has touched core stuff like that recently.
>  > > > Can you reproduce (or rather help others to reproduce) with the
>  > > > solr/example setup?
>  > > >
>  > > > -Yonik
>  > > >
>  > > > On Wed, Apr 30, 2008 at 5:39 PM, Matthew Runo <[EMAIL PROTECTED]> 
> wrote:
>  > > >
>  > > >
>  > > > > Hello!
>  > > > >
>  > > > > In using the SVN head version of Solr, I've found that recently we
>  > started
>  > > > > getting multiple open SegmentReaders, all registered... etc..
>  > > > >
>  > > > > Any ideas why this would happen? They don't go away unless the server
>  > is
>  > > > > restarted, and don't go away with commits, etc. In fact, commits seem
>  > to
>  > > > > cause the issue. They're causing issues since it causes really stale
>  > > > > searchers to be around...
>  > > > >
>  > > > > For example, right now...
>  > > > > org.apache.solr.search.SolrIndexSearcher
>  > > > > caching : true
>  > > > > numDocs : 153312
>  > > > > maxDoc : 153324
>  > > > > readerImpl : SegmentReader
>  > > > > readerDir : org.apache.lucene.store.FSDirectory@/opt/solr/data/index
>  > > > > indexVersion : 1205944085143
>  > > > > openedAt : Wed Apr 30 14:04:15 PDT 2008
>  > > > > registeredAt : Wed Apr 30 14:04:15 PDT 2008
>  > > > >
>  > > > > (and right below that one...)
>  > > > > org.apache.solr.search.SolrIndexSearcher
>  > > > > caching : true
>  > > > > numDocs : 153312
>  > > > > maxDoc : 153324
>  > > > > readerImpl : SegmentReader
>  > > > > readerDir : org.apache.lucene.store.FSDirectory@/opt/solr/data/index
>  > > > > indexVersion : 1205944085143
>  > > > > openedAt : Wed Apr 30 14:30:02 PDT 2008
>  > > > > registeredAt : Wed Apr 30 14:30:02 PDT 2008
>  > > > >
>  > > > > Thanks!
>  > > > >
>  > > > > Matthew Runo
>  > > > > Software Developer
>  > > > > Zappos.com
>  > > > > 702.943.7833
>  > > > >
>  > > > >
>  > > > >
>  > > > >
>  > > >
>  > > >
>  > > >
>  > >
>  > >
>  >
>  >
>


Re: Distributed Search (shard) w/ Multicore?

2008-05-02 Thread Yonik Seeley
On Fri, May 2, 2008 at 3:36 PM, Jon Baer <[EMAIL PROTECTED]> wrote:
>  Im trying to figure out if I can do this or if something else needs to be
> set, trying to run a query over multiple cores w/ the shard param?  I seem
> to be getting the correct number of results back but no data ... any ideas?

Should work OK (note that schemas should match across cores...
distributed search is not federated search).
You might need to be a little more explicit about what you are sending
and what you are getting back (the actual URL of the request, and the
actual XML of the response).

-Yonik


Re: Distributed Search (shard) w/ Multicore?

2008-05-02 Thread Yonik Seeley
Try adding echoParams=all to the request.
Maybe there is a default rows=0 or something.

Are you using a recent version of Solr?

-Yonik

On Fri, May 2, 2008 at 4:30 PM, Jon Baer <[EMAIL PROTECTED]> wrote:
> Sorry about that, Im sending something simple like:
>
>http://search.company.com:8115/solr/search/players?q=Smith&shards=box1:8115/search/players,box2:8115/search/players
>
>  Im getting back:
>
>  
>   
> 0
> 18
>   
>   
>  
>
>  Identical schemas, it found the correct 13 but no docs attached.  In the
> logs I can see the results come back (w/ wt=javabin&isShared=true) ...
>
>  - Jon
>
>
>
>  On May 2, 2008, at 3:41 PM, Yonik Seeley wrote:
>
>
> > On Fri, May 2, 2008 at 3:36 PM, Jon Baer <[EMAIL PROTECTED]> wrote:
> >
> > > Im trying to figure out if I can do this or if something else needs to
> be
> > > set, trying to run a query over multiple cores w/ the shard param?  I
> seem
> > > to be getting the correct number of results back but no data ... any
> ideas?
> > >
> >
> > Should work OK (note that schemas should match across cores...
> > distributed search is not federated search).
> > You might need to be a little more explicit about what you are sending
> > and what you are getting back (the actual URL of the request, and the
> > actual XML of the response).
> >
> > -Yonik


Re: single character terms in index - why?

2008-05-12 Thread Yonik Seeley
On Mon, May 12, 2008 at 4:13 PM, Naomi Dushay <[EMAIL PROTECTED]> wrote:
>  So I'm now asking:  why would SOLR want single character terms?

Solr, like Lucene, can be configured however you want.  The example
schema is just that - an example.

But, there are many field types that might be interested in keeping
single letter terms.
One can even think of examples where single letter terms would be
useful for normal full-text fields, depending on the domain or on the
analysys.

One simple example:  "d-day" might be alternately indexed as "d" "day"
so it would be found with a query of "d day"

-Yonik


Re: AND vs. OR query performance

2008-05-12 Thread Yonik Seeley
In general, AND will perform better than OR (because of skipping in
the scorers).  But if the number of documents matching the AND is
close to that matching the OR query, then skipping doesn't gain you
much and probably has a little more overhead.

-Yonik

On Sun, May 11, 2008 at 4:04 AM, Lars Kotthoff <[EMAIL PROTECTED]> wrote:
> Dear list,
>
>   during some performance experiments I have found that queries with ORed 
> search
>  terms are significantly faster than queries with ANDed search terms, 
> everything
>  else being equal.
>
>  Does anybody know whether this is the generally expected behaviour?
>
>  Thanks,
>
>  Lars
>


Re: Field Grouping

2008-05-12 Thread Yonik Seeley
On Mon, May 12, 2008 at 9:58 PM, oleg_gnatovskiy
<[EMAIL PROTECTED]> wrote:
>  Hello. I was wondering if there is a way to get solr to return fields with
>  the same value for a particular field together. For example I might want to
>  have all the documents with exactly the same name field all returned next to
>  each other. Is this possible? Thanks!

Sort by that field.  Since you can only sort by fields with a single
term at most (this rules out full-text fields), you might want to do a
copyField of the "name" field to something like a "name_s" field which
is of type string (which can be sorted on).

-Yonik


Re: Commit problems on Solr 1.2 with Tomcat

2008-05-13 Thread Yonik Seeley
By default, a commit won't return until a new searcher has been opened
and the results are visible.
So just make sure you wait for the commit command to return before querying.

Also, if you are committing every add, you can avoid a separate commit
command by putting ?commit=true in the URL of the add command.

-Yonik

On Tue, May 13, 2008 at 9:31 AM, Alexander Ramos Jardim
<[EMAIL PROTECTED]> wrote:
> Maybe a delay in commit? How may time elapsed between commits?
>
>  2008/5/13 William Pierce <[EMAIL PROTECTED]>:
>
>
>
>  > Hi,
>  >
>  > I am having problems with Solr 1.2 running tomcat version 6.0.16 (I also
>  > tried 6.0.14 but same problems exist).  Here is the situation:  I have an
>  > ASP.net application where I am trying to  and  a single
>  > document to an index.   After I add the document and issue the  I
>  > can see (in the solr stats page) that the commit count has been increment
>  > but the docsPending is 1,  and my document is still not visible from a
>  > search perspective.
>  >
>  > When I issue another ,  the commit counter increments,
>  >  docsPending is now zero,  and my document is visible and searchable.
>  >
>  > I saw that someone was observing problems with 6.0.16 tomcat,  so I
>  > reverted back to 6.0.14.  Same problem.
>  >
>  > Can anyone help?
>  >
>  > -- Bill
>
>
>
>
>  --
>  Alexander Ramos Jardim
>


Re: ERROR:unknown field, but what document was it?

2008-05-13 Thread Yonik Seeley
On Thu, May 8, 2008 at 4:59 PM,  <[EMAIL PROTECTED]> wrote:
>  My tests showed that it was a big difference. It took about 1.2 seconds to
> index 500 separate adds in separate xml files (with a single commit
> afterwards), compared to about 200 milliseconds when sending a single xml
> with 500 adds.

Did you overlap the adds (use multiple threads)?

-Yonik


Re: Commit problems on Solr 1.2 with Tomcat

2008-05-13 Thread Yonik Seeley
Is SendSolrIndexingRequest synchronous or asynchronous?
If the call to SendSolrIndexingRequest() can return before the
response from the add is received, then the commit could sneak in and
finish *before* the add is done (in which case, you won't see it
before the next commit).

-Yonik

On Tue, May 13, 2008 at 10:49 AM, William Pierce <[EMAIL PROTECTED]> wrote:
> Erik:  I am indeed issuing multiple Solr requests.
>
>  Here is my code snippet (deletexml and addxml are the strings that contain
> the  and  strings for the items to be added or deleted).   For
> our simple example,  nothing is being deleted so "stufftodelete" is always
> false.
>
> //we are done...we now need to post the requests...
>if (stufftodelete)
>{
>SendSolrIndexingRequest(deletexml);
>}
>if (stufftoadd)
>{
>SendSolrIndexingRequest(addxml);
>}
>
>if ( stufftodelete || stufftoadd)
>{
>SendSolrIndexingRequest(" waitSearcher=\"true\"/>");
>}
>
>  I am using the full form of the commit here just to see if the 
> was somehow not working.
>
>  The SendSolrIndexingRequest is the routine that takes the string argument
> and issues the POST request to the update URL.
>
>  Thanks,
>
>  Bill
>
>  --
>  From: "Erik Hatcher" <[EMAIL PROTECTED]>
>  Sent: Tuesday, May 13, 2008 7:40 AM
>
>
>  To: 
>  Subject: Re: Commit problems on Solr 1.2 with Tomcat
>
>
> > I'm not sure if you are issuing a separate  _request_ after  your
> , or putting a  into the same request.  Solr only  supports
> one command (add or commit, but not both) per request.
> >
> > Erik
> >
> >
> > On May 13, 2008, at 10:36 AM, William Pierce wrote:
> >
> >
> > > Thanks for the comments
> > >
> > > The reason I am just adding one document followed by a commit is  for
> this particular test --- in actuality,  I will be loading  documents from a
> db. But thanks for the pointer on the ?commit=true  on the add command.
> > >
> > > Now on the  problem itself,  I am still confused:   Doesn't
> the commit count of 1 indicate that the commit is completed?
> > >
> > > In any event,  just for testing purposes,  I started everything  from
> scratch (deleted all documents, stopped/restarted tomcat).  I  noticed that
> the only files in my index folder were:  segments.gen  and segments_1.
> > >
> > > Then I did the add followed by  and noticed that there  were
> now three files:  segments.gen, segments_1 and write.lock.
> > >
> > > Now it is 7 minutes later, and when I query the index using the
> "http://localhost:59575/splus1/admin/"; url, I still do not see the document.
> > >
> > > Again, when I issue another  command everything seems to
> work. Why are TWO commit commands apparently required?
> > >
> > > Thanks,
> > >
> > > Sridhar
> > >
> > > --
> > > From: "Yonik Seeley" <[EMAIL PROTECTED]>
> > > Sent: Tuesday, May 13, 2008 6:42 AM
> > > To: 
> > > Subject: Re: Commit problems on Solr 1.2 with Tomcat
> > >
> > >
> > > > By default, a commit won't return until a new searcher has been
> opened
> > > > and the results are visible.
> > > > So just make sure you wait for the commit command to return before
> querying.
> > > >
> > > > Also, if you are committing every add, you can avoid a separate
> commit
> > > > command by putting ?commit=true in the URL of the add command.
> > > >
> > > > -Yonik
> > > >
> > > > On Tue, May 13, 2008 at 9:31 AM, Alexander Ramos Jardim
> > > > <[EMAIL PROTECTED]> wrote:
> > > >
> > > > > Maybe a delay in commit? How may time elapsed between commits?
> > > > >
> > > > >  2008/5/13 William Pierce <[EMAIL PROTECTED]>:
> > > > >
> > > > >
> > > > >
> > > > >  > Hi,
> > > > >  >
> > > > >  > I am having problems with Solr 1.2 running tomcat version  6.0.16
> (I also
> > > > >  > tried 6.0.14 but same problems exist).  Here is the  situation:
> I have an
> > > > >  > ASP.net application where I am trying to  and  a
> single
> > > > >  > document to an index.   After I add the document and issue the
>  I
> > > > >  > can see (in the solr stats page) that the commit count has  been
> increment
> > > > >  > but the docsPending is 1,  and my document is still not  visible
> from a
> > > > >  > search perspective.
> > > > >  >
> > > > >  > When I issue another ,  the commit counter increments,
> > > > >  >  docsPending is now zero,  and my document is visible and
> searchable.
> > > > >  >
> > > > >  > I saw that someone was observing problems with 6.0.16 tomcat,
> so I
> > > > >  > reverted back to 6.0.14.  Same problem.
> > > > >  >
> > > > >  > Can anyone help?
> > > > >  >
> > > > >  > -- Bill
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >  --
> > > > >  Alexander Ramos Jardim
> > > > >
> > > > >
> > > >
> > >
> >
> >
> >
>


Re: Commit problems on Solr 1.2 with Tomcat

2008-05-16 Thread Yonik Seeley
Don't rely on looking at the files in the index directory to tell if
an optimize has been performed.

http://www.nabble.com/what%27s-up-with%3A-java--Ddata%3Dargs--jar-post.jar-%22%3Coptimize-%3E%22-to16162870.html#a16179673

-Yonik

On Fri, May 16, 2008 at 12:00 AM, Eason. Lee <[EMAIL PROTECTED]> wrote:
> similar problem I met before was using the  operation
> The first time I sent  to solr , the optimize operation did have
> down.
> But files  were not merged. When i sent another  to solr, all the
> files were merged.
> This seems to happen just in Windows
>
>
> 2008/5/13, Yonik Seeley <[EMAIL PROTECTED]>:
>>
>> Is SendSolrIndexingRequest synchronous or asynchronous?
>> If the call to SendSolrIndexingRequest() can return before the
>> response from the add is received, then the commit could sneak in and
>> finish *before* the add is done (in which case, you won't see it
>> before the next commit).
>>
>> -Yonik
>>
>> On Tue, May 13, 2008 at 10:49 AM, William Pierce <[EMAIL PROTECTED]>
>> wrote:
>> > Erik:  I am indeed issuing multiple Solr requests.
>> >
>> >  Here is my code snippet (deletexml and addxml are the strings that
>> contain
>> > the  and  strings for the items to be added or deleted).
>> For
>> > our simple example,  nothing is being deleted so "stufftodelete" is
>> always
>> > false.
>> >
>> > //we are done...we now need to post the requests...
>> >if (stufftodelete)
>> >{
>> >SendSolrIndexingRequest(deletexml);
>> >}
>> >if (stufftoadd)
>> >{
>> >SendSolrIndexingRequest(addxml);
>> >}
>> >
>> >if ( stufftodelete || stufftoadd)
>> >{
>> >SendSolrIndexingRequest("> > waitSearcher=\"true\"/>");
>> >}
>> >
>> >  I am using the full form of the commit here just to see if the > />
>> > was somehow not working.
>> >
>> >  The SendSolrIndexingRequest is the routine that takes the string
>> argument
>> > and issues the POST request to the update URL.
>> >
>> >  Thanks,
>> >
>> >  Bill
>> >
>> >  --
>> >  From: "Erik Hatcher" <[EMAIL PROTECTED]>
>> >  Sent: Tuesday, May 13, 2008 7:40 AM
>> >
>> >
>> >  To: 
>> >  Subject: Re: Commit problems on Solr 1.2 with Tomcat
>> >
>> >
>> > > I'm not sure if you are issuing a separate  _request_
>> after  your
>> > , or putting a  into the same request.  Solr only  supports
>> > one command (add or commit, but not both) per request.
>> > >
>> > > Erik
>> > >
>> > >
>> > > On May 13, 2008, at 10:36 AM, William Pierce wrote:
>> > >
>> > >
>> > > > Thanks for the comments
>> > > >
>> > > > The reason I am just adding one document followed by a commit is  for
>> > this particular test --- in actuality,  I will be loading  documents from
>> a
>> > db. But thanks for the pointer on the ?commit=true  on the add command.
>> > > >
>> > > > Now on the  problem itself,  I am still confused:   Doesn't
>> > the commit count of 1 indicate that the commit is completed?
>> > > >
>> > > > In any event,  just for testing purposes,  I started everything  from
>> > scratch (deleted all documents, stopped/restarted tomcat).  I  noticed
>> that
>> > the only files in my index folder were:  segments.gen  and segments_1.
>> > > >
>> > > > Then I did the add followed by  and noticed that
>> there  were
>> > now three files:  segments.gen, segments_1 and write.lock.
>> > > >
>> > > > Now it is 7 minutes later, and when I query the index using the
>> > "http://localhost:59575/splus1/admin/"; url, I still do not see the
>> document.
>> > > >
>> > > > Again, when I issue another  command everything seems to
>> > work. Why are TWO commit commands apparently required?
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Sridhar
>> > > >
>> > > > --
>> 

Re: solr feed problem

2008-05-18 Thread Yonik Seeley
\ufffd isn't really a valid character.
http://www.fileformat.info/info/unicode/char/fffd/index.html
Your XML document or data probably had some kind of encoding issue
along the way somewhere.

-Yonik

On Sun, May 18, 2008 at 7:59 PM, Cam Bazz <[EMAIL PROTECTED]> wrote:
> hello,
>
> I am trying to feed solr with xml files of my own schema, and I am getting:
>
> SEVERE: org.xmlpull.v1.XmlPullParserException: entity reference names can
> not start with character '\ufffd'
>
> my xml is utf8 for sure, as well as the text inside. but for some reason I
> get this exception and then solr crashes.
>
> Any ideas?
>
> Best Regards,
> -C.B.
>


Re: Iterating the entire dataset

2008-05-28 Thread Yonik Seeley
On Wed, May 28, 2008 at 6:26 PM, Daniel Garcia <[EMAIL PROTECTED]> wrote:
> Is there a simple way to query the entire dataset? I want to be able to 
> iterate through every document in the index.

q=*:*&start=0&rows=100
q=*:*&start=100&rows=100
etc

You could specify a very large rows, but it's probably best to handle
in pages/chunks.

-Yonik


Re: Solr indexing configuration help

2008-05-28 Thread Yonik Seeley
Not sure why you would be getting an OOM from just indexing, and with
the 1.5G heap you've given the JVM.
Have you tried Sun's JVM?

-Yonik

On Wed, May 28, 2008 at 7:35 PM, gaku113 <[EMAIL PROTECTED]> wrote:
>
> Hi all Solr users/developers/experts,
>
> I have the following scenario and I appreciate any advice for tuning my solr
> master server.
>
> I have a field in my schema that would index (but not stored) about ~1
> ids for each document.  This field is expected to govern the size of the
> document.  Each id can contain up to 6 characters.  I figure that there are
> two alternatives for this field, one is the use a string multi-valued field,
> and the other would be to pass a white-space-delimited string to solr and
> have solr tokenize such string based on whitespace (the text_ws fieldType).
> The master server is expected to receive constant stream of updates.
>
> The expected/estimated document size can range from 50k to 100k for a single
> document.  (I know this is quite large). The number of documents is expected
> to be around 200,000 on each master server, and there can be multiple master
> servers (sharding).  I wish the master can handle more docs too if I can
> figure a way out.
>
> Currently, I'm performing some basic stress tests to simulate the indexing
> side on the master server.  This stress test would continuously add new
> documents at the rate of about 10 documents every 30 seconds.  Autocommit is
> being used (50 docs and 180 seconds constraints), but I have no idea if this
> is the preferred way.  The goal is to keep adding new documents until we can
> get at least 200,000 documents (or about 20GB of index) on the master (or
> even more if the server can handle it)
>
> What I experienced from the indexing stress test is that the master server
> failed to respond after a while, such as non-pingable when there are about
> 30k documents.  When looking at the log, they are mostly:
> java.lang.OutOfMemoryError: Java heap space
> OR
> Ping query caused exception: null (this is probably caused by the OOM
> problem)
>
> There were also a few cases that the java process even went away.
>
> Questions:
> 1)  Is it better to use the multi-valued string field or the text_ws field
> for this large field?
> 2)  Is it better to have more outstanding docs per commit or more frequent
> commit, in term of maximizing server resources?  What is the preferred way
> to commit documents assuming that solr master receives updates frequently?
> How many updated docs should there be before issuing a commit?
> 3)  How to avoid the OOM problem in my case? I'm already doing (-Xms1536M
> -Xmx1536M) on a 2-GB machine. Is that not enough?  I'm concerned that adding
> more Ram would just delay the OOM problem.  Any additional JVM option to
> consider?
> 4)  Any recommendation for the master server configuration, in a sense 
> that I
> can maximize the number of indexed docs?
> 5)  How can it disable caching on the master altogether as queries won't 
> hit
> the master?
> 6)  For an average doc size of 50k-100k, is that too large for solr, or 
> even
> solr is the right tool? If not, any alternative?  If we are able to reduce
> the size of docs, can we expect to index more documents?
>
> The followings are info related to software/hardware/configuration:
>
> Solr version (solr nightly build on 5/23/2008)
>Solr Specification Version: 1.2.2008.05.23.08.06.59
>Solr Implementation Version: nightly
>Lucene Specification Version: 2.3.2
>Lucene Implementation Version: 2.3.2 652650
>Jetty: 6.1.3
>
> Schema.xml (the section that I think are relevant to the master server.)
>
> omitNorms="true"/>
> positionIncrementGap="100">
>  
>
>  
>
>
>  />
>  multiValued="true" omitNorms="true"/>
> stored="false"
> omitNorms="true"/>
>
> id
>
> Solrconfig.xml
>  
>false
>10
>500
>50
>5000
>2
>1000
>1
>
> org.apache.lucene.index.LogByteSizeMergePolicy
> org.apache.lucene.index.ConcurrentMergeScheduler
>single
>  
>
>  
>false
>50
>10
>
>500
>5000
>2
>false
>  
>  
>
>
>  50
>  18
>
>
>  solr/bin/snapshooter
>  .
>  true
>
>  
>
>  
>50
>  class="solr.LRUCache"
>  size="0"
>  initialSize="0"
>  autowarmCount="0"/>
>  class="solr.LRUCache"
>  size="0"
>  initialSize="0"
>  autowarmCount="0"/>
>  class="solr.LRUCache"
>  size="0"
>  initialSize="0"
>  autowarmCount="0"/>
>true
>
>1
>1
>
>
>  
> user_id 0  name="rows">1 
>static newSearcher warming query from
> solrconfig.xml
>  
>
>
>  
> fast_warm 0  name="rows">10 
>static firstSearcher warming query from
> solrconfig.xml
>  
>
>false
>4
>  
>
> Replication:
>The snappuller is scheduled to run every 15 mins fo

Re: Solr indexing configuration help

2008-05-28 Thread Yonik Seeley
On Wed, May 28, 2008 at 10:30 PM, Gaku Mak <[EMAIL PROTECTED]> wrote:
> I used the admin GUI to get the java info.
> java.vm.specification.vendor = Sun Microsystems Inc.
Well, your original email listed IcedTea... but that is mostly Sun
code,  so maybe that's why the vendor is still listed as Sun.

I'd recommend downloading1.6.0_3 from java.sun.com and trying that.

Later versions (1.6.0_04+) have a JVM bug that bites Lucene, so stick
with 1.6.0_03 for now.

-Yonik


> Any suggestion?  Thanks a lot for your help!!
>
> -Gaku
>
>
> Yonik Seeley wrote:
>>
>> Not sure why you would be getting an OOM from just indexing, and with
>> the 1.5G heap you've given the JVM.
>> Have you tried Sun's JVM?
>>
>> -Yonik
>>
>> On Wed, May 28, 2008 at 7:35 PM, gaku113 <[EMAIL PROTECTED]> wrote:
>>>
>>> Hi all Solr users/developers/experts,
>>>
>>> I have the following scenario and I appreciate any advice for tuning my
>>> solr
>>> master server.
>>>
>>> I have a field in my schema that would index (but not stored) about
>>> ~1
>>> ids for each document.  This field is expected to govern the size of the
>>> document.  Each id can contain up to 6 characters.  I figure that there
>>> are
>>> two alternatives for this field, one is the use a string multi-valued
>>> field,
>>> and the other would be to pass a white-space-delimited string to solr and
>>> have solr tokenize such string based on whitespace (the text_ws
>>> fieldType).
>>> The master server is expected to receive constant stream of updates.
>>>
>>> The expected/estimated document size can range from 50k to 100k for a
>>> single
>>> document.  (I know this is quite large). The number of documents is
>>> expected
>>> to be around 200,000 on each master server, and there can be multiple
>>> master
>>> servers (sharding).  I wish the master can handle more docs too if I can
>>> figure a way out.
>>>
>>> Currently, I'm performing some basic stress tests to simulate the
>>> indexing
>>> side on the master server.  This stress test would continuously add new
>>> documents at the rate of about 10 documents every 30 seconds.  Autocommit
>>> is
>>> being used (50 docs and 180 seconds constraints), but I have no idea if
>>> this
>>> is the preferred way.  The goal is to keep adding new documents until we
>>> can
>>> get at least 200,000 documents (or about 20GB of index) on the master (or
>>> even more if the server can handle it)
>>>
>>> What I experienced from the indexing stress test is that the master
>>> server
>>> failed to respond after a while, such as non-pingable when there are
>>> about
>>> 30k documents.  When looking at the log, they are mostly:
>>> java.lang.OutOfMemoryError: Java heap space
>>> OR
>>> Ping query caused exception: null (this is probably caused by the OOM
>>> problem)
>>>
>>> There were also a few cases that the java process even went away.
>>>
>>> Questions:
>>> 1)  Is it better to use the multi-valued string field or the text_ws
>>> field
>>> for this large field?
>>> 2)  Is it better to have more outstanding docs per commit or more
>>> frequent
>>> commit, in term of maximizing server resources?  What is the preferred
>>> way
>>> to commit documents assuming that solr master receives updates
>>> frequently?
>>> How many updated docs should there be before issuing a commit?
>>> 3)  How to avoid the OOM problem in my case? I'm already doing
>>> (-Xms1536M
>>> -Xmx1536M) on a 2-GB machine. Is that not enough?  I'm concerned that
>>> adding
>>> more Ram would just delay the OOM problem.  Any additional JVM option to
>>> consider?
>>> 4)  Any recommendation for the master server configuration, in a
>>> sense that I
>>> can maximize the number of indexed docs?
>>> 5)  How can it disable caching on the master altogether as queries
>>> won't hit
>>> the master?
>>> 6)  For an average doc size of 50k-100k, is that too large for solr,
>>> or even
>>> solr is the right tool? If not, any alternative?  If we are able to
>>> reduce
>>> the size of docs, can we expect to index more documents?
>>>
>>> The followings are info related to software/hardware/configuration:
>>

Re: EnableLazyFieldLoading?

2008-05-28 Thread Yonik Seeley
On Wed, May 28, 2008 at 11:00 PM, Dallan Quass <[EMAIL PROTECTED]> wrote:
> If I'm loading say 80-90% of the fields 80-90% of the time, and I don't have
> any large compressed text fields, is it safe to say that I'm probably better
> off to turn off lazy field loading?

Yes, as long as the 10-20% aren't really big.

-Yonik


Re: Want to drill down facet search result

2008-05-29 Thread Yonik Seeley
On Thu, May 29, 2008 at 12:22 PM, Rusli Ruslakall
<[EMAIL PROTECTED]> wrote:
> searched forever before posting and of course I found it shortly after :)
>
> Can use facet.prefix, beautiful!

You can also constrain both results and facets to any arbitrary query
via fq=myquery

-Yonik


> On Thu, May 29, 2008 at 3:43 PM, Rusli Ruslakall
> <[EMAIL PROTECTED]> wrote:
>> Hi,
>>
>> I index something like this:
>>
>> 
>>Company A
>>123
>>456
>>789
>> 
>>
>> 
>>Company B
>>129
>>123
>>987
>> 
>>
>> So I ONLY want to display all category names starting with '12' and
>> how many companies are in each one.
>>
>> In this example it should output:
>>
>> name count
>> 123  (2)
>> 129  (1)
>>
>>
>> What I have now is:
>> http://localhost:8983/solr/select/?q=cat:12&facet=true&facet.limit=-1&facet.field=cat&facet.mincount=1
>>
>> But with this I get all the categories which I would rather not prefer:
>>
>> name count
>> 123  (2)
>> 456  (1) <-- Rather not get this information
>> 789  (1) <-- Rather not get this information
>> 129  (1)
>> 987  (1) <-- Rather not get this information
>>
>>
>> Is there some way of achieving this in Solr?
>>
>> Thanks alot!
>> Jon
>>
>


Re: Search query optimization

2008-05-29 Thread Yonik Seeley
On Thu, May 29, 2008 at 4:05 PM, Yongjun Rong <[EMAIL PROTECTED]> wrote:
>  I have a question about how the lucene query parser. For example, I
> have query "A AND B AND C". Will lucene extract all documents satisfy
> condition A in memory and then filter it with condition B and C?

No, Lucene will try and optimize this the best it can.

It roughly goes like this..
docnum = find_match("A")
docnum = find_first_match_after(docnum, "B")
docnum = find_first_match_after(docnum,"C")
etc...
until the same docnum is returned for "A","B", and "C".

See ConjunctionScorer for the gritty details.

-Yonik



> or only
> the documents satisfying "A AND B AND C" will be put into memory? Is
> there any articles discuss about how to build a optimization query to
> save memory and improve performance?
>  Thank you very much.
>  Yongjun Rong
>


Re: Relevancy Issue - How do I make it work?

2008-05-29 Thread Yonik Seeley
field norms of un-boosted fields are normally less than 1 (it's a
factor that weights larger fields less).
The index-time boost is also multiplied into this factor though.
Given that your first doc had a huge norm, it looks like the document
or field was boosted at index time?

-Yonik

On Thu, May 29, 2008 at 9:22 PM, Tim Christensen <[EMAIL PROTECTED]> wrote:
> Hi,
>
> This is my first post. I have been working with Lucene for about 4 weeks and
> Solr for just about 10 days. We are going to convert our site search over to
> Solr as soon as we figure out some of the nuances.
>
> As I was testing out the synonyms features to decide how we could best use
> it, I searched for iPod (I know it is an example, but we actually sell
> them). I was shocked when the search results were nothing close to an iPod.
>
> Looking closer, I could see that the description had an iPod word in it,
> just 1. With debug on, that fact is confirmed (this is the first result):
> 
> 152529.23 = (MATCH) fieldWeight(search_text:ipod in 6247), product of:
>  1.0 = tf(termFreq(search_text:ipod)=1)
>  3.7238584 = idf(docFreq=522)
>  40960.0 = fieldNorm(field=search_text, doc=6247)
> 
> Here is an explainOther, FOR an actual iPod SKU (in the same search):
> id:650085488
>  
>  
> 1.0473351 = (MATCH) fieldWeight(search_text:ipod in 6985), product of:
>  3.0 = tf(termFreq(search_text:ipod)=9)
>  3.7238584 = idf(docFreq=522)
>  0.09375 = fieldNorm(field=search_text, doc=6985)
> 
> If the term frequency is higher, the only difference is'fieldNorm' which I
> do not understand in the context of relevancy. Does this have to do with
> omitNorms in some way?
> In a related factor, I also tried the dismax query with the following line
> in it:
> search_text^0.5 brand^10.0 keywords^5.0 title^20.0
> sub_title^1.5 model^2.0 attribute^1.1
> As an experiment I boosted the title a bunch, since this is where the term
> iPod exists the most. It made no effect, in fact, it was not even working.
> The title was not being used at all, just the search_text, even though I
> have it indexed.
> Here is the relevant schema parts
>required="true" />
>   
>   
>stored="true" />
>   
>   
>   
>multiValued="true" />
>   
>stored="true" />
>   
>   
>   
>   
>stored="true" />
>   
>/>
>   
>/>
>   
>   
>   
>   
>   
>multiValued="true" termVectors="true"/>
>
> search_text
>
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
> Thanks to all who are willing to take a look at this and help.
>
> 
> Tim Christensen
> Director Media & Technology
> Vann's Inc.
> 406-203-4656
>
> [EMAIL PROTECTED]
>
> http://www.vanns.com
>
>
>
>
>
>
>
>


Re: Relevancy Issue - How do I make it work?

2008-05-29 Thread Yonik Seeley
On Thu, May 29, 2008 at 9:44 PM, Tim Christensen <[EMAIL PROTECTED]> wrote:
> Yonik,
>
> Thank you for the response. You are correct, regular (non-accessory)
> products are boosted '2.0' at index time. However both items the non ipod
> item and the ipod would have received the initial boost on the same fields
> since they are both non-accessory items.
>
> Is your comment still relevant in that context?

Yes.
There's a bug somewhere that ended up boosting that document or field
much more than normal.

First thing is to determine if it's in your indexing code, or in Solr.
Is there a way for you to verify the exact data you sent to Solr for
that document (the exact XML, if that is what you are sending?)

-Yonik


> Tim
>
> On May 29, 2008, at 7:30 PM, Yonik Seeley wrote:
>
>> field norms of un-boosted fields are normally less than 1 (it's a
>> factor that weights larger fields less).
>> The index-time boost is also multiplied into this factor though.
>> Given that your first doc had a huge norm, it looks like the document
>> or field was boosted at index time?
>>
>> -Yonik
>>
>> On Thu, May 29, 2008 at 9:22 PM, Tim Christensen <[EMAIL PROTECTED]> wrote:
>>>
>>> Hi,
>>>
>>> This is my first post. I have been working with Lucene for about 4 weeks
>>> and
>>> Solr for just about 10 days. We are going to convert our site search over
>>> to
>>> Solr as soon as we figure out some of the nuances.
>>>
>>> As I was testing out the synonyms features to decide how we could best
>>> use
>>> it, I searched for iPod (I know it is an example, but we actually sell
>>> them). I was shocked when the search results were nothing close to an
>>> iPod.
>>>
>>> Looking closer, I could see that the description had an iPod word in it,
>>> just 1. With debug on, that fact is confirmed (this is the first result):
>>> 
>>> 152529.23 = (MATCH) fieldWeight(search_text:ipod in 6247), product of:
>>> 1.0 = tf(termFreq(search_text:ipod)=1)
>>> 3.7238584 = idf(docFreq=522)
>>> 40960.0 = fieldNorm(field=search_text, doc=6247)
>>> 
>>> Here is an explainOther, FOR an actual iPod SKU (in the same search):
>>> id:650085488
>>> 
>>> 
>>> 1.0473351 = (MATCH) fieldWeight(search_text:ipod in 6985), product of:
>>> 3.0 = tf(termFreq(search_text:ipod)=9)
>>> 3.7238584 = idf(docFreq=522)
>>> 0.09375 = fieldNorm(field=search_text, doc=6985)
>>> 
>>> If the term frequency is higher, the only difference is'fieldNorm' which
>>> I
>>> do not understand in the context of relevancy. Does this have to do with
>>> omitNorms in some way?
>>> In a related factor, I also tried the dismax query with the following
>>> line
>>> in it:
>>> search_text^0.5 brand^10.0 keywords^5.0 title^20.0
>>> sub_title^1.5 model^2.0 attribute^1.1
>>> As an experiment I boosted the title a bunch, since this is where the
>>> term
>>> iPod exists the most. It made no effect, in fact, it was not even
>>> working.
>>> The title was not being used at all, just the search_text, even though I
>>> have it indexed.
>>> Here is the relevant schema parts
>>>  >> required="true" />
>>>  
>>>  
>>>  >> stored="true" />
>>>  
>>>  
>>>  
>>>  >> multiValued="true" />
>>>  
>>>  >> stored="true" />
>>>  
>>>  
>>>  
>>>  
>>>  >> stored="true" />
>>>  >> />
>>>  >> />
>>>  >> />
>>>  >> stored="true"
>>> />
>>>  >> />
>>>  
>>>  
>>>  
>>>  
>>>  >> multiValued="true" termVectors="true"/>
>>>
>>> search_text
>>>
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  
>>> Thanks to all who are willing to take a look at this and help.
>>>
>>> 
>>> Tim Christensen
>>> Director Media & Technology
>>> Vann's Inc.
>>> 406-203-4656
>>>
>>> [EMAIL PROTECTED]
>>>
>>> http://www.vanns.com
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>
>
> 
> Tim Christensen
> Director Media & Technology
> Vann's Inc.
> 406-203-4656
>
> [EMAIL PROTECTED]
>
> http://www.vanns.com
>
>
>
>
>
>
>
>


Re: Solr indexing configuration help

2008-05-29 Thread Yonik Seeley
It's most likely a
1) hardware issue: bad memory
 OR
2) incompatible libraries (most likely libc version for the JVM).

If you have another box around, try that.

-Yonik

On Thu, May 29, 2008 at 9:51 PM, Gaku Mak <[EMAIL PROTECTED]> wrote:
>
> Hi Yonik and others,
>
> I'm getting this java error after switching to JVM 1.6.0_3.  This error
> occurs after the stress test has been going for a while and failed at 12K
> docs level and at 18K again.  Am I doing something wrong?  Please help!
>
> Thanks!
>
> #
> # An unexpected error has been detected by Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x2adfbf6d, pid=25030, tid=1079175504
> #
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (1.6.0_03-b05 mixed mode)
> # Problematic frame:
> # V  [libjvm.so+0x230f6d]
> #
> # An error report file with more information is saved as hs_err_pid25030.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://java.sun.com/webapps/bugreport/crash.jsp
> #
>
> -Gaku
>
>
> Yonik Seeley wrote:
>>
>> On Wed, May 28, 2008 at 10:30 PM, Gaku Mak <[EMAIL PROTECTED]> wrote:
>>> I used the admin GUI to get the java info.
>>> java.vm.specification.vendor = Sun Microsystems Inc.
>> Well, your original email listed IcedTea... but that is mostly Sun
>> code,  so maybe that's why the vendor is still listed as Sun.
>>
>> I'd recommend downloading1.6.0_3 from java.sun.com and trying that.
>>
>> Later versions (1.6.0_04+) have a JVM bug that bites Lucene, so stick
>> with 1.6.0_03 for now.
>>
>> -Yonik
>>
>>
>>> Any suggestion?  Thanks a lot for your help!!
>>>
>>> -Gaku
>>>
>>>
>>> Yonik Seeley wrote:
>>>>
>>>> Not sure why you would be getting an OOM from just indexing, and with
>>>> the 1.5G heap you've given the JVM.
>>>> Have you tried Sun's JVM?
>>>>
>>>> -Yonik
>>>>
>>>> On Wed, May 28, 2008 at 7:35 PM, gaku113 <[EMAIL PROTECTED]> wrote:
>>>>>
>>>>> Hi all Solr users/developers/experts,
>>>>>
>>>>> I have the following scenario and I appreciate any advice for tuning my
>>>>> solr
>>>>> master server.
>>>>>
>>>>> I have a field in my schema that would index (but not stored) about
>>>>> ~1
>>>>> ids for each document.  This field is expected to govern the size of
>>>>> the
>>>>> document.  Each id can contain up to 6 characters.  I figure that there
>>>>> are
>>>>> two alternatives for this field, one is the use a string multi-valued
>>>>> field,
>>>>> and the other would be to pass a white-space-delimited string to solr
>>>>> and
>>>>> have solr tokenize such string based on whitespace (the text_ws
>>>>> fieldType).
>>>>> The master server is expected to receive constant stream of updates.
>>>>>
>>>>> The expected/estimated document size can range from 50k to 100k for a
>>>>> single
>>>>> document.  (I know this is quite large). The number of documents is
>>>>> expected
>>>>> to be around 200,000 on each master server, and there can be multiple
>>>>> master
>>>>> servers (sharding).  I wish the master can handle more docs too if I
>>>>> can
>>>>> figure a way out.
>>>>>
>>>>> Currently, I'm performing some basic stress tests to simulate the
>>>>> indexing
>>>>> side on the master server.  This stress test would continuously add new
>>>>> documents at the rate of about 10 documents every 30 seconds.
>>>>> Autocommit
>>>>> is
>>>>> being used (50 docs and 180 seconds constraints), but I have no idea if
>>>>> this
>>>>> is the preferred way.  The goal is to keep adding new documents until
>>>>> we
>>>>> can
>>>>> get at least 200,000 documents (or about 20GB of index) on the master
>>>>> (or
>>>>> even more if the server can handle it)
>>>>>
>>>>> What I experienced from the indexing stress test is that the master
>>>>> server
>>>>> failed to respond after a while, such as non-pingable when there are
>>>>> about
>>>>> 30k documents.  When looking at the log, they ar

Re: Solr indexing configuration help

2008-05-30 Thread Yonik Seeley
Some things to try:
- turn off autowarming on the master
- turn off autocommit, unless you really need it, or change it to be
less agressive:  autocommitting every 50 docs is bad if you are
rapidly adding documents.
- set maxWarmingSearchers to 1 to prevent the buildup of searchers

-Yonik

On Fri, May 30, 2008 at 3:39 PM, Gaku Mak <[EMAIL PROTECTED]> wrote:
>
> I started running the test on 2 other machines with similar specs but more
> RAM (4G). One of them now has about 60k docs and still running fine. On the
> other machine, solr died at about 43k docs. A short while before solr died,
> I saw that there were 5 searchers at the same time. Do any of you know why
> would solr create 5 searchers, and if that could cause solr to die? Is there
> any way to prevent this? Also is there a way to totally disable the searcher
> and whether that is a way to optimize the solr master?
>
> I copied the following from the SOLR Statistics page in case it has
> interested info:
>
> name:[EMAIL PROTECTED] main
> class:  org.apache.solr.search.SolrIndexSearcher
> version:1.0
> description:index searcher
> stats:  caching : true
> numDocs : 42754
> maxDoc : 42754
> readerImpl : MultiSegmentReader
> readerDir :
> org.apache.lucene.store.FSDirectory@/var/lib/solr/peoplesolr_0002/solr/data/index
> indexVersion : 1211702500453
> openedAt : Fri May 30 10:04:15 PDT 2008
> registeredAt : Fri May 30 10:05:05 PDT 2008
>
> name:   [EMAIL PROTECTED] main
> class:  org.apache.solr.search.SolrIndexSearcher
> version:1.0
> description:index searcher
> stats:  caching : true
> numDocs : 42754
> maxDoc : 42754
> readerImpl : MultiSegmentReader
> readerDir :
> org.apache.lucene.store.FSDirectory@/var/lib/solr/peoplesolr_0002/solr/data/index
> indexVersion : 1211702500453
> openedAt : Fri May 30 10:03:24 PDT 2008
> registeredAt : Fri May 30 10:03:41 PDT 2008
>
> name:   [EMAIL PROTECTED] main
> class:  org.apache.solr.search.SolrIndexSearcher
> version:1.0
> description:index searcher
> stats:  caching : true
> numDocs : 42675
> maxDoc : 42675
> readerImpl : MultiSegmentReader
> readerDir :
> org.apache.lucene.store.FSDirectory@/var/lib/solr/peoplesolr_0002/solr/data/index
> indexVersion : 1211702500450
> openedAt : Fri May 30 10:00:53 PDT 2008
> registeredAt : Fri May 30 10:01:05 PDT 2008
>
> name:   [EMAIL PROTECTED] main
> class:  org.apache.solr.search.SolrIndexSearcher
> version:1.0
> description:index searcher
> stats:  caching : true
> numDocs : 42697
> maxDoc : 42697
> readerImpl : MultiSegmentReader
> readerDir :
> org.apache.lucene.store.FSDirectory@/var/lib/solr/peoplesolr_0002/solr/data/index
> indexVersion : 1211702500451
> openedAt : Fri May 30 10:02:20 PDT 2008
> registeredAt : Fri May 30 10:02:22 PDT 2008
>
> name:   [EMAIL PROTECTED] main
> class:  org.apache.solr.search.SolrIndexSearcher
> version:1.0
> description:index searcher
> stats:  caching : true
> numDocs : 42724
> maxDoc : 42724
> readerImpl : MultiSegmentReader
> readerDir :
> org.apache.lucene.store.FSDirectory@/var/lib/solr/peoplesolr_0002/solr/data/index
> indexVersion : 1211702500452
> openedAt : Fri May 30 10:02:55 PDT 2008
> registeredAt : Fri May 30 10:02:57 PDT 2008
>
> Thank you all so much for your help. I really appreciate it.
>
> -Gaku
>
> Yonik Seeley wrote:
>>
>> It's most likely a
>> 1) hardware issue: bad memory
>>  OR
>> 2) incompatible libraries (most likely libc version for the JVM).
>>
>> If you have another box around, try that.
>>
>> -Yonik
>>
>
> --
> View this message in context: 
> http://www.nabble.com/Solr-indexing-configuration-help-tp17524364p17566612.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: Solr indexing configuration help

2008-06-01 Thread Yonik Seeley
On Sun, Jun 1, 2008 at 4:43 AM, Gaku Mak <[EMAIL PROTECTED]> wrote:
> I have tried Yonik's suggestions with the following:
> 1) all autowarming are off
> 2) commented out firstsearch and newsearcher event handlers
> 3) increased autocommit interval to 600 docs and 30 minutes (previously 50
> docs and 5 minutes)

Glad it looks like your memory issues are solved, but I really
wouldn't use "docs" at all for an autocommit criteria it will just
slow down your full index builds.

-Yonik

> In addition, I updated the java option with the following:
> -d64 -server -Xms2048M -Xmx3072M -XX:-HeapDumpOnOutOfMemoryError
> -XX:+UseSerialGC
>
> Results:
> I'm currently at 100,000 documents now with about 9.0GB index on a quad
> machine with 4GB ram.  The stress test is to add 20 documents every 30
> seconds now.
>
> It seems like the serial GC works better than the other two alternatives
> (-XX:+UseParallelGC or -XX:+UseConcMarkSweepGC) for some reason.  I have not
> seen any OOM since the changes mentioned above (yet).  If others have better
> experience with other GC and know how to configure it properly, please let
> me know because using serial GC just doesn't sound right on a quad machine.
>
> Additional questions:
> Does anyone know how solr/lucene use heap in terms of their generations
> (young vs tenured) on the indexing environment?  If we have this answer, we
> would be able to better configure the young/tenured ratio in the heap.  Any
> help is appreciated!  Thanks!
>
> Now, I'm looking into configuring the slave machines.  Well, that's a
> separate question.
>
>
>
> Yonik Seeley wrote:
>>
>> Some things to try:
>> - turn off autowarming on the master
>> - turn off autocommit, unless you really need it, or change it to be
>> less agressive:  autocommitting every 50 docs is bad if you are
>> rapidly adding documents.
>> - set maxWarmingSearchers to 1 to prevent the buildup of searchers
>>
>> -Yonik
>>
>> On Fri, May 30, 2008 at 3:39 PM, Gaku Mak <[EMAIL PROTECTED]> wrote:
>>>
>>> I started running the test on 2 other machines with similar specs but
>>> more
>>> RAM (4G). One of them now has about 60k docs and still running fine. On
>>> the
>>> other machine, solr died at about 43k docs. A short while before solr
>>> died,
>>> I saw that there were 5 searchers at the same time. Do any of you know
>>> why
>>> would solr create 5 searchers, and if that could cause solr to die? Is
>>> there
>>> any way to prevent this? Also is there a way to totally disable the
>>> searcher
>>> and whether that is a way to optimize the solr master?
>>>
>>> I copied the following from the SOLR Statistics page in case it has
>>> interested info:
>>>
>>> name:[EMAIL PROTECTED] main
>>> class:  org.apache.solr.search.SolrIndexSearcher
>>> version:1.0
>>> description:index searcher
>>> stats:  caching : true
>>> numDocs : 42754
>>> maxDoc : 42754
>>> readerImpl : MultiSegmentReader
>>> readerDir :
>>> org.apache.lucene.store.FSDirectory@/var/lib/solr/peoplesolr_0002/solr/data/index
>>> indexVersion : 1211702500453
>>> openedAt : Fri May 30 10:04:15 PDT 2008
>>> registeredAt : Fri May 30 10:05:05 PDT 2008
>>>
>>> name:   [EMAIL PROTECTED] main
>>> class:  org.apache.solr.search.SolrIndexSearcher
>>> version:1.0
>>> description:index searcher
>>> stats:  caching : true
>>> numDocs : 42754
>>> maxDoc : 42754
>>> readerImpl : MultiSegmentReader
>>> readerDir :
>>> org.apache.lucene.store.FSDirectory@/var/lib/solr/peoplesolr_0002/solr/data/index
>>> indexVersion : 1211702500453
>>> openedAt : Fri May 30 10:03:24 PDT 2008
>>> registeredAt : Fri May 30 10:03:41 PDT 2008
>>>
>>> name:   [EMAIL PROTECTED] main
>>> class:  org.apache.solr.search.SolrIndexSearcher
>>> version:1.0
>>> description:index searcher
>>> stats:  caching : true
>>> numDocs : 42675
>>> maxDoc : 42675
>>> readerImpl : MultiSegmentReader
>>> readerDir :
>>> org.apache.lucene.store.FSDirectory@/var/lib/solr/peoplesolr_0002/solr/data/index
>>> indexVersion : 1211702500450
>>> openedAt : Fri May 30 10:00:53 PDT 2008
>>> registeredAt : Fri May 30 10:01:05 PDT 2008
>>>
>>> name:   [EMAIL PROTECTED] main
>>> class:  org.apache.solr.search.SolrIndexSearcher
>>> version: 

Re: solr slave configuration help

2008-06-01 Thread Yonik Seeley
On Sun, Jun 1, 2008 at 5:20 AM, Gaku Mak <[EMAIL PROTECTED]> wrote:
[...]
> I also have some test script to query against the slave server; however,
> whenever during snapinstall, OOM would occur and the server is not very
> responsive (even with autowarm disabled).  After a while (like couple
> minutes), the server can respond again.  Is this expected?

Not really expected, no.
Is the server unresponsive to a single search request (i.e. it takes a
long time to complete)?
Are you load testing, or just trying single requests?

> I have set the heap size to 1.5GB out of the 2GB physical ram.  Any help is
> appreciated.  Thanks!

Try a smaller heap.
The OS needs memory to cache the Lucene index structures too (Lucene
does very little caching and depends on the OS to do it for good
performance).


-Yonik


Re: new user: some questions about parameters and query syntax

2008-06-01 Thread Yonik Seeley
On Sat, May 31, 2008 at 1:03 AM, Chris Hostetter
<[EMAIL PROTECTED]> wrote:
> The only reason *any* existing solr params use comma or white space
> seperated lists is because way, way, WAY back in the day [...]

I dunno...  for something like "fl", it still seems a bit verbose to
list every field separately.
Some of these things feel like trade offs in ease of readability &
manually typing of URLs vs ease of programmatic manipulation.

-Yonik


Re: large data to index

2008-06-01 Thread Yonik Seeley
On Sun, Jun 1, 2008 at 11:39 AM, Kevin Xiao <[EMAIL PROTECTED]> wrote:
> Hi,
>
> We have large data to index, index size is about 40 G and getting bigger. We 
> are try to figure out a way to speed up indexing.
>
> 1.   Single solr server, multiple indexers, which will speed up document 
> parsing time, but I am not sure if single solr server can handle multiple 
> requests with reasonable performance.
>
> 2.   Multiple solr servers, multiple indexers, which definitely will 
> work, but I am not sure how to combine indexes from different solr server 
> into one.
> Does anyone have any experience about it? Appreciate any points.

If your searching needs are relatively basic (search, sorted facets,
highlighting), you might keep the indexes separate and then use
distributed search to search across them:

http://wiki.apache.org/solr/DistributedSearch

-Yonik


Re: Newbie Q: searching multiple fields

2008-06-02 Thread Yonik Seeley
Verify all the fields you want to search on indexed
Verify that the query is being correctly built by adding
debugQuery=true to the request

-Yonik


On Mon, Jun 2, 2008 at 1:53 PM, Jon Drukman <[EMAIL PROTECTED]> wrote:
> I am brand new to Solr.  I am trying to get a very simple setup running.
>  I've got just a few fields: name, description, tags.  I am only able to
> search on the default field (name) however.  I tried to set up the dismax
> config to search all the fields, but I never get any results on the other
> fields.  Example doc:
>
> 
>  318
>  Testing the new system
>  Here is the very descriptive description
>  jsd
>  
>  2008-05-16T05:05:10Z
> 
>
> q=system finds this doc.
>
> q=descriptive does not.
>
> q=descriptive&qt=dismax does not
>
> q=descriptive&qt=dismax&qf=description does not
>
> my solrconfig contains:
>
>  
>
> explicit
> 0.01
> 
>name^2 description^1.5 tags^0.8
> 
> 
>name^2 description^2 tags^1
> 
> 100
> *:*
>
>  
>
> What am I missing?
> -jsd-
>
>


Re: slowdown after 15K queries

2008-06-02 Thread Yonik Seeley
On Mon, Jun 2, 2008 at 1:49 PM, Bram de Jong <[EMAIL PROTECTED]> wrote:
> On Mon, Jun 2, 2008 at 10:13 AM, Erick Erickson <[EMAIL PROTECTED]> wrote:
>> But are you sure you're not just masking the problem? That is, your limit
>> may now be 90,000 queries...
>>
>> I always assume this kind of thing is a memory leak somewhere, have you
>> any tools to monitor your memory consumption and see if that's ever-rising?
>
> 1. is it possible to tell Solr to use more cache mem than is allocated
> to the VM? Will this crash Solr? What are the defaults that java runs
> with? This is my config script:
> http://iua-share.upf.edu/svn/nightingale/trunk/sandbox/solr/config/solrconfig.xml

Use smaller caches unless you need bigger ones for some reason.

   

If you aren't faceting, I'd make this smaller.




This is definitely too big.  There's almost never a reason to autowarm
that many queries too.  I could see setting the autowarm count
anywhere from 10 to 100.  The query cache itself probably doesn't need
to be larger than a few hundred.

  


This is probably too big also, and probably doesn't buy you much.
I'd set it anywhere from 64 to 1024

-Yonik

> 2. I noticed that if I let the script keep running (after slowdown) it
> would crash eventually, giving an *if I recall correctly* Heap
> Allocation Exception or something like it. In the traceback I saw
> DisMax somewhere mentioned as well. If you want (if it is useful) I
> can run it again until it crashes and give the full stack trace.
>
> 3. after changing the VM mem, I ran it again and it went to well over
> 100K queries without any problems (I think I also noticed less disk
> access - but I can be wrong)
>
>
> I can monitor mem usage through top or through anything you would like
> me to run, ... If there are any particular tests to run let me know!
>
>
>  - Bram
>
> PS:
> http://freesound.iua.upf.edu/blog/ has now more about my adventures with Solr
>


Re: Newbie Q: searching multiple fields

2008-06-02 Thread Yonik Seeley
On Mon, Jun 2, 2008 at 2:55 PM, Jon Drukman <[EMAIL PROTECTED]> wrote:
> Yonik Seeley wrote:
>>
>> Verify all the fields you want to search on indexed
>> Verify that the query is being correctly built by adding
>> debugQuery=true to the request
>
> here is the schema.xml extract:
>
>required="true" />
>   
>   

There is your issue:  type "string" indexes the whole field value as a
single token.
You want type "text" like you have on the name field.

-Yonik


>   
>   
>   
>
> here is the debugQuery output.  i have no idea how to read it:
>
> 
>  
>  0
>  0
>  
>   dismax
>   descriptive
>   1
>  
>  
>  
>  
>  descriptive
>  descriptive
>  +DisjunctionMaxQuery((tags:descriptive^0.8 |
> description:descriptive^1.5 | name:descript^2.0)~0.01)
> DisjunctionMaxQuery((tags:descriptive | description:descriptive^2.0 |
> name:descript^2.0)~0.01)
>  +(tags:descriptive^0.8 |
> description:descriptive^1.5 | name:descript^2.0)~0.01 (tags:descriptive |
> description:descriptive^2.0 | name:descript^2.0)~0.01
>  
>  
>  
>  
> 
>
>


Re: 1.3 DisMax and MoreLikeThis

2008-06-04 Thread Yonik Seeley
On Wed, Jun 4, 2008 at 11:11 AM, Tom Morton <[EMAIL PROTECTED]> wrote:
>   I wanted to use the new dismax support for more like this described in
> SOLR-295  but can't even get
> the new syntax for dismax to work (described in
> SOLR-281).
> Any ideas if this functionality works?
>
> Here's the relevant part of my solr config,
>
>   defType="dismax">

defType is just another parameter and should appear in the defaults
section below.
-Yonik

>
> explicit
> 0.01
> 
>relatedExact^2 genre^0.5
> 
> 100
> *:*
>
>  
>
> Example query:
> http://localhost:13280/solr/genre?indent=on&version=2.2&q=terrence+howard&start=0&rows=10&fl=*%2Cscore&wt=standard&debugQuery=on&explainOther=&hl.fl=
>
> Debug output: (I would expect to see dismax scoring)
>
> 
> 11.151003 = (MATCH) sum of:
>  6.925395 = (MATCH) weight(name:terrence in 63941), product of:
>0.7880709 = queryWeight(name:terrence), product of:
>  10.0431795 = idf(docFreq=234, numDocs=1988249)
>  0.07846827 = queryNorm
>8.787782 = (MATCH) fieldWeight(name:terrence in 63941), product of:
>  1.0 = tf(termFreq(name:terrence)=1)
>  10.0431795 = idf(docFreq=234, numDocs=1988249)
>  0.875 = fieldNorm(field=name, doc=63941)
>  4.2256074 = (MATCH) weight(name:howard in 63941), product of:
>0.6155844 = queryWeight(name:howard), product of:
>  7.84501 = idf(docFreq=2116, numDocs=1988249)
>  0.07846827 = queryNorm
>6.8643837 = (MATCH) fieldWeight(name:howard in 63941), product of:
>  1.0 = tf(termFreq(name:howard)=1)
>  7.84501 = idf(docFreq=2116, numDocs=1988249)
>  0.875 = fieldNorm(field=name, doc=63941)
>
>
> Here's my build info:
> Solr Specification Version: 1.2.2008.06.02.15.21.48
> Solr Implementation Version: 1.3-dev 662524M - tsmorton - 2008-06-02
> 15:21:48
>
> Is this feature now broken or does it look like my config is wrong?
>
> Thanks...Tom
>


Re: How to run Solr on Linux ?

2008-06-05 Thread Yonik Seeley
On Fri, Jun 6, 2008 at 1:13 AM, Akeel <[EMAIL PROTECTED]> wrote:
> I downloaded Solr and successfully run it (by running *java -jar
> start.jar *from
> *example *directory) on windows machine but when i try to run it (in the
> same way as i did on windows) on my linux machine, i get following error:
>
> er.dir=/solr/example
> 2008-06-05 17:36:17.343::WARN:  failed SolrRequestFilter
> java.lang.NoClassDefFoundError: org.apache.solr.core.SolrCore
>   at java.lang.Class.initializeClass(libgcj.so.7rh)

Your best bet is to try downloading and using the JVM from Sun.

-Yonik


Re: boost ignored with wildcard queries

2008-06-06 Thread Yonik Seeley
On Fri, Jun 6, 2008 at 5:16 PM, David Smiley @MITRE.org
<[EMAIL PROTECTED]> wrote:
> Curious... Why is ConstantScoreQuery only applied to prefix queries?  Your
> rationale suggests that it is also applicable wildcard query and fuzzy query
> too (basically any place an analyzer isn't used).

I think fuzzy queries may have been fixed in lucene to not exceed the
boolean query clause limit.
WildCard queries: no good reason... didn't really need it, so I never
got around to it :-)

-Yonik

> ~ David Smiley
>
>
> Yonik Seeley wrote:
>>
>> On Tue, Feb 26, 2008 at 7:23 PM, Head <[EMAIL PROTECTED]> wrote:
>>>
>>>  Using the StandardRequestHandler, it appears that the index boost values
>>> are
>>>  ignored when the query has a wildcard in it.   For example, if I have 2
>>>  's and one has a boost of 1.0 and another has a boost of 10.0, then
>>> I
>>>  do a search for "bob*", both records will be returned with the same
>>> score of
>>>  1.0.   If I just do a normal search then the  that has the higher
>>> boost
>>>  has the higher score as expected.
>>>
>>>  Is this a bug?
>>
>> A feature :-)
>> Solr uses ConstantScoreRangeQuery and ConstantScorePrefixQuery to
>> avoid getting exceptions from too many terms.
>>
>> -Yonik
>>
>>
>>>  ~Tom
>>>
>>>  p.s. Here's what my debug looks like:
>>>
>>>  
>>>  1.0 = (MATCH)
>>>  ConstantScoreQuery([EMAIL PROTECTED]), product
>>> of:
>>>   1.0 = boost
>>>   1.0 = queryNorm
>>>  
>>>  
>>>  1.0 = (MATCH)
>>>  ConstantScoreQuery([EMAIL PROTECTED]), product
>>> of:
>>>   1.0 = boost
>>>   1.0 = queryNorm
>>>  
>>>  --
>>>  View this message in context:
>>> http://www.nabble.com/boost-ignored-with-wildcard-queries-tp15703334p15703334.html
>>>  Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/boost-ignored-with-wildcard-queries-tp15703334p17701306.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: Improve Solr Performance

2008-06-08 Thread Yonik Seeley
Some of these cache values are too large and will drastically slow
some things down (like commiting new changes to the index) or may
cause you to run out of memory over time.  I would revert the cache
params back to what they were in the example solrconfig.xml

Then focus on requirements: are your query times too high for some
specific queries that you are actually going to use in production?  If
so, show us what the actual query looks like and we can see if there
is a way to speed them up.

-Yonik



On Sun, Jun 8, 2008 at 11:36 AM, khirb7 <[EMAIL PROTECTED]> wrote:
> thank you for your response
> It's clear that fetching all documents at one time takes a lot of time so
> the search is too slow. I decided to fetch less than 20 documents  at one
> time, but to optimise solr I have tried to increase the size of all the
> caches (the document cache, filterCache, result cache) to 8912 and the value
> of the autoWarm param to 4096, I have also increase the value  of
> queryResultWindowSize from 10 to 1000;  put the value of OmitNorms param of
> some of my fields to true; and use the fq parameter to use filter instead of
> creating the filter by a boolean clause in the q parameter.
> at the end I have noticed that all the changes don't improrve the QTime to
> be lower  but it becomes greater than before  applying changes ; except the
> setting of  OmitNorms to true for three parameter has made lower the QTime.
>
> how to deal with that, do someone has advices or can explain me what are the
> appropriate values  to set to the parameters I mentioned above to improve
> QTime ???
>
> in addition to this I want just to know if the value given to OmitNorms has
> only effect at search time but  ot at indexing time.
>
> thank you in advance.


Re: searching only within allowed documents

2008-06-10 Thread Yonik Seeley
On Mon, Jun 9, 2008 at 7:44 PM, Stephen Weiss <[EMAIL PROTECTED]> wrote:
> However, in the plain text search, the user automatically searches through
> *all* of the folders to which they have subscribed.  This means, for (good!)
> users who have subscribed to a large (1000+) number of folders, the filter
> query would be quite long,

This is not a well-solved problem in Lucene & Solr in general.

> and would exceed the default number of boolean
> parameters allowed.

Solr allows you to specify filters in separate parameters that are
applied to the main query, but cached separately.

q=the user query&fq=folder:f13&fq=folder:f24

The other option is to have a user field and index the users that have
access to the specific document.  The downside to this is that the
document must be re-indexed to reflect permission changes (like a new
user that now has access to it).  This may or may not be feasible,
depending on how many users you have to support and how fast
permissions must change.

> Now, I'm reading on this tutorial page for Lucene:
>  http://www.lucenetutorial.com/techniques/permission-filtering.html that the
> best way to do this would involve some combination of HitCollector &
> FieldCache.  From what the author is saying, this sounds like exactly what I
> need.  Unfortunately, I am almost completely Java-illiterate, and on top of
> that, I'm  not really finding any explanation of:
>
> a) What exactly I would do with the HitCollector & FieldCache objects that
> would help me achieve this goal - even just at the level of Lucene, there's
> no real explanation in the tutorial
> or

I think he's saying that with the FieldCache, you can get the external
String id of each matching document and then through some other
external mechanism, determine if that document should be allowed.  So
that still leaves that application-specific part to be solved.

> b) Where exactly these classes fit in to Solr (if they do at all)

A custom request handler or a custom query component would be the
likely place to add/change behavior.

> So far I have already written my own (tiny, tiny) Tokenizer and
> TokenizerFactory for correctly parsing the tags that come in from the
> database, and that works great,

What's the format of the tags... you might be able to use an existing
tokenizer (a regex one perhaps).

-Yonik


Re: range query highlighting

2008-06-11 Thread Yonik Seeley
It's a known deficiency... ConstantScoreRangeQuery and
ConstantScorePrefixQuery which Solr uses  rewrite to a
ConstantScoreQuery and don't expose the terms they match.
Performance-wise it seems like a bad idea if the number of terms
matched is large (esp when used in a MultiSearcher or later in
global-idf for distributed search).

-Yonik

On Wed, Jun 11, 2008 at 11:09 AM, Stefan Oestreicher
<[EMAIL PROTECTED]> wrote:
> Hi,
>
> I'm using solr built from trunk and highlighting for range queries doesn't
> work.
> If I search for "2008" everything works as expected but if I search for
> "[2000 TO 2008]" nothing gets highlighted.
> The field I'm searching on is a TextField and I've confirmed that the query
> and index analyzers are working as expected.
> I didn't find anything in the issue tracker about this.
>
> Any ideas?
>
> TIA,
>
> Stefan Oestreicher
>
> --
> Dr. Maté GmbH
> Stefan Oestreicher / Entwicklung
> [EMAIL PROTECTED]
> http://www.netdoktor.at
> Tel Buero: + 43 1 405 55 75 24
> Fax Buero: + 43 1 405 55 75 55
> Alser Str. 4 1090 Wien Altes AKH Hof 1 1.6.6
>
>


Re: Question about fieldNorm

2008-06-11 Thread Yonik Seeley
That is strange... did you re-index or change the index?  If so, you
might want to verify that docid=3454 still corresponds to the same
document you queried earlier.

-Yonik


On Wed, Jun 11, 2008 at 1:09 PM, Brendan Grainger
<[EMAIL PROTECTED]> wrote:
> I've just changed the stemming algorithm slightly and am running a few tests
> against the old stemmer versus the new stemmer. I did a query for 'hanger'
> and using the old stemmer I get the following scoring for a document with
> the title: Converter Hanger Assembly Replacement
>
> 6.4242806 = (MATCH) sum of:
>  2.5697122 = (MATCH) max of:
>0.2439919 = (MATCH) weight(markup_t:hanger in 3454), product of:
>  0.1963516 = queryWeight(markup_t:hanger), product of:
>6.5593724 = idf(docFreq=6375, numDocs=1655591)
>0.02993451 = queryNorm
>  1.2426275 = (MATCH) fieldWeight(markup_t:hanger in 3454), product of:
>1.7320508 = tf(termFreq(markup_t:hanger)=3)
>6.5593724 = idf(docFreq=6375, numDocs=1655591)
>0.109375 = fieldNorm(field=markup_t, doc=3454)
>2.5697122 = (MATCH) weight(title_t:hanger^2.0 in 3454), product of:
>  0.5547002 = queryWeight(title_t:hanger^2.0), product of:
>2.0 = boost
>9.265229 = idf(docFreq=425, numDocs=1655591)
>0.02993451 = queryNorm
>  4.6326146 = (MATCH) fieldWeight(title_t:hanger in 3454), product of:
>1.0 = tf(termFreq(title_t:hanger)=1)
>9.265229 = idf(docFreq=425, numDocs=1655591)
>0.5 = fieldNorm(field=title_t, doc=3454)
>  3.8545685 = (MATCH) max of:
>0.12199595 = (MATCH) weight(markup_t:hanger^0.5 in 3454), product of:
>  0.0981758 = queryWeight(markup_t:hanger^0.5), product of:
>0.5 = boost
>6.5593724 = idf(docFreq=6375, numDocs=1655591)
>0.02993451 = queryNorm
>  1.2426275 = (MATCH) fieldWeight(markup_t:hanger in 3454), product of:
>1.7320508 = tf(termFreq(markup_t:hanger)=3)
>6.5593724 = idf(docFreq=6375, numDocs=1655591)
>0.109375 = fieldNorm(field=markup_t, doc=3454)
>3.8545685 = (MATCH) weight(title_t:hanger^3.0 in 3454), product of:
>  0.8320503 = queryWeight(title_t:hanger^3.0), product of:
>3.0 = boost
>9.265229 = idf(docFreq=425, numDocs=1655591)
>0.02993451 = queryNorm
>  4.6326146 = (MATCH) fieldWeight(title_t:hanger in 3454), product of:
>1.0 = tf(termFreq(title_t:hanger)=1)
>9.265229 = idf(docFreq=425, numDocs=1655591)
>0.5 = fieldNorm(field=title_t, doc=3454)
>
> Using the new stemmer I get:
>
> 5.621245 = (MATCH) sum of:
>  2.248498 = (MATCH) max of:
>0.24399184 = (MATCH) weight(markup_t:hanger in 3454), product of:
>  0.19635157 = queryWeight(markup_t:hanger), product of:
>6.559371 = idf(docFreq=6375, numDocs=1655589)
>0.029934512 = queryNorm
>  1.2426274 = (MATCH) fieldWeight(markup_t:hanger in 3454), product of:
>1.7320508 = tf(termFreq(markup_t:hanger)=3)
>6.559371 = idf(docFreq=6375, numDocs=1655589)
>0.109375 = fieldNorm(field=markup_t, doc=3454)
>2.248498 = (MATCH) weight(title_t:hanger^2.0 in 3454), product of:
>  0.5547002 = queryWeight(title_t:hanger^2.0), product of:
>2.0 = boost
>9.265228 = idf(docFreq=425, numDocs=1655589)
>0.029934512 = queryNorm
>  4.0535374 = (MATCH) fieldWeight(title_t:hanger in 3454), product of:
>1.0 = tf(termFreq(title_t:hanger)=1)
>9.265228 = idf(docFreq=425, numDocs=1655589)
>0.4375 = fieldNorm(field=title_t, doc=3454)
>  3.372747 = (MATCH) max of:
>0.12199592 = (MATCH) weight(markup_t:hanger^0.5 in 3454), product of:
>  0.09817579 = queryWeight(markup_t:hanger^0.5), product of:
>0.5 = boost
>6.559371 = idf(docFreq=6375, numDocs=1655589)
>0.029934512 = queryNorm
>  1.2426274 = (MATCH) fieldWeight(markup_t:hanger in 3454), product of:
>1.7320508 = tf(termFreq(markup_t:hanger)=3)
>6.559371 = idf(docFreq=6375, numDocs=1655589)
>0.109375 = fieldNorm(field=markup_t, doc=3454)
>3.372747 = (MATCH) weight(title_t:hanger^3.0 in 3454), product of:
>  0.83205026 = queryWeight(title_t:hanger^3.0), product of:
>3.0 = boost
>9.265228 = idf(docFreq=425, numDocs=1655589)
>0.029934512 = queryNorm
>  4.0535374 = (MATCH) fieldWeight(title_t:hanger in 3454), product of:
>1.0 = tf(termFreq(title_t:hanger)=1)
>9.265228 = idf(docFreq=425, numDocs=1655589)
>0.4375 = fieldNorm(field=title_t, doc=3454)
>
> The thing that is perplexing is that the fieldNorm for the title_t field is
> different in each of the explanations, ie: the fieldNorm using the old
> stemmer is: 0.5 = fieldNorm(field=title_t, doc=3454). For the new stemmer
>  0.4375 = fieldNorm(field=title_t, doc=3454). I ran the title through both
> stemmers and get the same number of tokens produced. I do no index time
> boosting on the title_t field. I am using DefaultSimilarity in both
> i

Re: Question about fieldNorm

2008-06-11 Thread Yonik Seeley
Field norms have limited precision (it's encoded as an 8 bit float) so
you are probably seeing rounding.

-Yonik

On Wed, Jun 11, 2008 at 2:13 PM, Brendan Grainger
<[EMAIL PROTECTED]> wrote:
> Hi Yonik,
>
> I just realized that the stemmer does make a difference because of synonyms.
> So on indexing using the new stemmer "converter hanger assembly replacement"
> gets expanded to: "converter hanger assembly assemble replacement" so there
> are 5 terms which gets a length norm of 0.4472136 instead of 0.5. Still
> unsure how it gets 0.4375 though as the result for the field norm though
> unless I have a boost of 0.9783 somewhere there.
>
> Brendan
>
>
> On Jun 11, 2008, at 1:37 PM, Yonik Seeley wrote:
>
>> That is strange... did you re-index or change the index?  If so, you
>> might want to verify that docid=3454 still corresponds to the same
>> document you queried earlier.
>>
>> -Yonik
>>
>>
>> On Wed, Jun 11, 2008 at 1:09 PM, Brendan Grainger
>> <[EMAIL PROTECTED]> wrote:
>>>
>>> I've just changed the stemming algorithm slightly and am running a few
>>> tests
>>> against the old stemmer versus the new stemmer. I did a query for
>>> 'hanger'
>>> and using the old stemmer I get the following scoring for a document with
>>> the title: Converter Hanger Assembly Replacement
>>>
>>> 6.4242806 = (MATCH) sum of:
>>> 2.5697122 = (MATCH) max of:
>>>  0.2439919 = (MATCH) weight(markup_t:hanger in 3454), product of:
>>>0.1963516 = queryWeight(markup_t:hanger), product of:
>>>  6.5593724 = idf(docFreq=6375, numDocs=1655591)
>>>  0.02993451 = queryNorm
>>>1.2426275 = (MATCH) fieldWeight(markup_t:hanger in 3454), product of:
>>>  1.7320508 = tf(termFreq(markup_t:hanger)=3)
>>>  6.5593724 = idf(docFreq=6375, numDocs=1655591)
>>>  0.109375 = fieldNorm(field=markup_t, doc=3454)
>>>  2.5697122 = (MATCH) weight(title_t:hanger^2.0 in 3454), product of:
>>>0.5547002 = queryWeight(title_t:hanger^2.0), product of:
>>>  2.0 = boost
>>>  9.265229 = idf(docFreq=425, numDocs=1655591)
>>>  0.02993451 = queryNorm
>>>4.6326146 = (MATCH) fieldWeight(title_t:hanger in 3454), product of:
>>>  1.0 = tf(termFreq(title_t:hanger)=1)
>>>  9.265229 = idf(docFreq=425, numDocs=1655591)
>>>  0.5 = fieldNorm(field=title_t, doc=3454)
>>> 3.8545685 = (MATCH) max of:
>>>  0.12199595 = (MATCH) weight(markup_t:hanger^0.5 in 3454), product of:
>>>0.0981758 = queryWeight(markup_t:hanger^0.5), product of:
>>>  0.5 = boost
>>>  6.5593724 = idf(docFreq=6375, numDocs=1655591)
>>>  0.02993451 = queryNorm
>>>1.2426275 = (MATCH) fieldWeight(markup_t:hanger in 3454), product of:
>>>  1.7320508 = tf(termFreq(markup_t:hanger)=3)
>>>  6.5593724 = idf(docFreq=6375, numDocs=1655591)
>>>  0.109375 = fieldNorm(field=markup_t, doc=3454)
>>>  3.8545685 = (MATCH) weight(title_t:hanger^3.0 in 3454), product of:
>>>0.8320503 = queryWeight(title_t:hanger^3.0), product of:
>>>  3.0 = boost
>>>  9.265229 = idf(docFreq=425, numDocs=1655591)
>>>  0.02993451 = queryNorm
>>>4.6326146 = (MATCH) fieldWeight(title_t:hanger in 3454), product of:
>>>  1.0 = tf(termFreq(title_t:hanger)=1)
>>>  9.265229 = idf(docFreq=425, numDocs=1655591)
>>>  0.5 = fieldNorm(field=title_t, doc=3454)
>>>
>>> Using the new stemmer I get:
>>>
>>> 5.621245 = (MATCH) sum of:
>>> 2.248498 = (MATCH) max of:
>>>  0.24399184 = (MATCH) weight(markup_t:hanger in 3454), product of:
>>>0.19635157 = queryWeight(markup_t:hanger), product of:
>>>  6.559371 = idf(docFreq=6375, numDocs=1655589)
>>>  0.029934512 = queryNorm
>>>1.2426274 = (MATCH) fieldWeight(markup_t:hanger in 3454), product of:
>>>  1.7320508 = tf(termFreq(markup_t:hanger)=3)
>>>  6.559371 = idf(docFreq=6375, numDocs=1655589)
>>>  0.109375 = fieldNorm(field=markup_t, doc=3454)
>>>  2.248498 = (MATCH) weight(title_t:hanger^2.0 in 3454), product of:
>>>0.5547002 = queryWeight(title_t:hanger^2.0), product of:
>>>  2.0 = boost
>>>  9.265228 = idf(docFreq=425, numDocs=1655589)
>>>  0.029934512 = queryNorm
>>>4.0535374 = (MATCH) fieldWeight(title_t:hanger in 3454), product of:
>>>  1.0 = tf(termFreq(title_t:hanger)=1)
>>>  9.265228 = idf(doc

Re: Problem with add a XML

2008-06-12 Thread Yonik Seeley
You need to define fields in the schema.xml (and otherwise change the
schema to match your data).
-Yonik

On Wed, Jun 11, 2008 at 3:46 AM, Thomas Lauer <[EMAIL PROTECTED]> wrote:
> 
> 
>  
>85f4fdf9-e596-4974-a5b9-57778e38067b
>143885
>28.10.2005 13:06:15
>Rechnung 2005-025235
>Rechnungsduplikate
>2002
>330T.doc
>KIS
>Bonow
>25906
>Hofma GmbH
>Mandant
>  
> 


Re: Re: Analytics e.g. "Top 10 searches"

2008-06-12 Thread Yonik Seeley
On Thu, Jun 12, 2008 at 3:04 PM, Shalin Shekhar Mangar
<[EMAIL PROTECTED]> wrote:
> Just as a thought, would it be possible to expose the original query text
> from the QueryResultCache keys (Query) somehow? If that is possible, it
> would allow us to query the top N most frequent queries anytime for
> reasonable values of N.

That would only give most recent, not most frequent.

-Yonik


Re: Problem with add a XML

2008-06-12 Thread Yonik Seeley
That can happen if the JVM died or got a critical error.
You can remove the lock file manually or configure Solr to remove it
manually (see solrconfig.xml)

-Yonik

On Thu, Jun 12, 2008 at 3:57 PM, Thomas Lauer <[EMAIL PROTECTED]> wrote:
>
> This is the error message from the console.
>
> SCHWERWIEGEND: org.apache.lucene.store.LockObtainFailedException: Lock
> obtain timed out: [EMAIL PROTECTED]:\Dokumente und E
> instellungen\tla\Desktop\solr\apache-solr-1.2.0\apache-solr-1.2.0\example\solr\data\index\write.lock
>at org.apache.lucene.store.Lock.obtain(Lock.java:70)
>at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:579)
>at org.apache.lucene.index.IndexWriter.(IndexWriter.java:341)
>at
> org.apache.solr.update.SolrIndexWriter.(SolrIndexWriter.java:65)
>at
> org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:120)
>at
> org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:181)
>at
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:259)
>at
> org.apache.solr.handler.XmlUpdateRequestHandler.update(XmlUpdateRequestHandler.java:166)
>at
> org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:84)
>at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:77)
>at org.apache.solr.core.SolrCore.execute(SolrCore.java:658)
>at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:191)
>at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:159)
>at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>at
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>at
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>at
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>at org.mortbay.jetty.Server.handle(Server.java:285)
>at
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>at
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
>at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
>at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
>at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>at
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>at
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
>
>
>
>
>
> Jón Helgi Jónsson wrote:
>>
>> Usually you get better error messages from the start.jar console, you
>> don't see anything there?
>>
>> On Thu, Jun 12, 2008 at 7:49 AM, Thomas Lauer <[EMAIL PROTECTED]> wrote:
>>>
>>> Yes my file is UTF-8. I Have Upload my file.
>>>
>>>
>>>
>>>
>>> Grant Ingersoll-6 wrote:


 On Jun 11, 2008, at 3:46 AM, Thomas Lauer wrote:

> now I want tho add die files to solr. I have start solr on windows
> in the example directory with java -jar start.jar
>
>
> I have the following Error Message:
>
> C:\test\output>java -jar post.jar *.xml
> SimplePostTool: version 1.2
> SimplePostTool: WARNING: Make sure your XML documents are encoded in
> UTF-8, other encodings are not currently supported


 This is your issue right here.  You have to save that second file in
 UTF-8.

>
> SimplePostTool: POSTing files to http://localhost:8983/solr/update..
> SimplePostTool: POSTing file 1.xml
> SimplePostTool: POSTing file 2.xml
> SimplePostTool: FATAL: Connection error (is Solr running at
> http://localhost:8983/solr/update
>  ?): java.io.IOException: S
> erver returned HTTP response code: 400 for URL:
> http://localhost:8983/solr/update
>
> C:\test\output>
>
> Regards Thomas Lauer
>
>
>
>
>
> __ Hinweis von ESET NOD32 Antivirus, Signaturdatenbank-
> Version 3175 (20080611) __
>
> E-Mail wurde geprüft mit ESET NOD32 Antivirus.
>
> http://www.eset.com

 --
 Grant Ingersoll
 http://www.lucidimagination.com

 Lucene Helpful Hints:
 http://wiki.apache.org/lucene-java/BasicsOfPerformance
 http://wiki.apache.org/lucene-java/LuceneFAQ






>

Re: Memory problems when highlight with not very big index

2008-06-13 Thread Yonik Seeley
On Fri, Jun 13, 2008 at 1:07 PM, Roberto Nieto <[EMAIL PROTECTED]> wrote:
> It´s possible to only
> allocate the 10 first results to make the snippet of only those results and
> use less memory?

That's how it currently works.

But there is a Document cache to make things more efficient.
If you have large documents, you might want to decrease this from it's
default size (see solrconfig.xml) which is currently 512.  Perhaps
move it down to 60 (which would allow for 6 concurrent requests of 10
docs each w/o re-fetching the doc between highlighting and response
writing).

-Yonik


Re: Memory problems when highlight with not very big index

2008-06-13 Thread Yonik Seeley
On Fri, Jun 13, 2008 at 3:30 PM, Roberto Nieto <[EMAIL PROTECTED]> wrote:
> The part that i can't understand very well is why if i desactivate
> highlighting the memory doesnt grows.
> It only uses doc cache if highlighting is used or if content retrieve is
> activated?

Perhaps you are highlighting some fields that you normally don't
return?  What is "fl" vs "hl.fl"?

-Yonik


Re: Dismax + Dynamic fields

2008-06-16 Thread Yonik Seeley
On Mon, Jun 16, 2008 at 10:46 AM, Norberto Meijome <[EMAIL PROTECTED]> wrote:
> I just wanted to confirm that dynamic fields cannot be used with dismax

There are two levels of dynamic field support.

Specific dynamic fields can be queried with dismax, but you can't
wildcard the "qf" or other field parameters.

-Yonik

> By this I mean that the following :
>
> schema.xml
> [...]
>stored="true" required="false" />
> [..]
>
> solrconfig.xml
> [..]
>  
>
>
> explicit
> 0.01
> 
> 
>field1^10.0 dyn_1_*^5.0
> 
> [...]
>
> will never take dyn_1_* fields into consideration when searching. I've 
> confirmed it with some tests, but maybe I'm missing something.
>
> From what I've read in some emails, it seems like this, but I haven't been 
> able to find a direct reference to this.
>
> TIA!
> B
>
> _
> {Beto|Norberto|Numard} Meijome
>
> Q. How do you make God laugh?
> A. Tell him your plans.
>
> I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
> Reading disclaimers makes you go blind. Writing them is worse. You have been 
> Warned.
>


Re: Adding records during a commit

2008-06-16 Thread Yonik Seeley
No records should be dropped, regardless of if a commit or optimize is going on.
Are you checking the return codes (HTTP return codes for Solr 1.3)?
Some updates could be failing for some reason.
Also grep for "Exception" in the solr log file.

-Yonik

On Mon, Jun 16, 2008 at 4:02 PM, dls1138 <[EMAIL PROTECTED]> wrote:
>
> I've been sending data in batches to Solr with no errors reported, yet after
> a commit, over 50% of the records I added (before the commit) do not show
> up- even after several subsequent commits down the road.
>
> Is it possible that Solr/Lucene could be disregarding or dropping my add
> queries if those queries were executed while a commit was running?
>
> For example, if I add 300 records, and then do a commit- during the 10-20
> seconds for the commit to execute (on an index over 1.2M records), if I add
> 100 more records during that 10-20 second time period, are those adds lost?
> I'm assuming they are not and will be visible after the next commit, but I
> want to be sure as it seems that some are being dropped. I just need to know
> if this can happen during commits or if I should be looking elsewhere to
> resolve my dropped record problem.
>
> Thanks.
>
>
> --
> View this message in context: 
> http://www.nabble.com/Adding-records-during-a-commit-tp17872257p17872257.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: Adding records during a commit

2008-06-16 Thread Yonik Seeley
On Mon, Jun 16, 2008 at 6:07 PM, dls1138 <[EMAIL PROTECTED]> wrote:
> I'm getting all 200 return codes from Solr on all of my batches.

IIRC, Solr1.2 uses the update servlet and always returns 200 (you need
to look at the response body to see if there was an error or not).

> I skimmed the logs for errors, but I didn't try to grep for "Exception". I
> will take your advice look there for some clues.
>
> Incidentally I'm running solr 1.2 using Jetty. I'm not on 1.3 because I read
> it wasn't released yet. Is there a (more stable than 1.2) branch of 1.3 I
> should be using instead?

If you aren't going to go into production for another month or so, I'd
start using 1.3
Start off with a new solrconfig.xml from 1.3 and re-make any
customizations to make sure you get the latest behavior.

> I know 1.2 is obviously dated, and came packaged with an old version of
> Lucene. Should I update either or both?

Solr takes care of updating Lucene for you... I wouldn't recommend
changing the version of Lucene independent of Solr unless you are
pretty experienced in Lucene.

-Yonik

>
>
>
>
> Yonik Seeley wrote:
>>
>> No records should be dropped, regardless of if a commit or optimize is
>> going on.
>> Are you checking the return codes (HTTP return codes for Solr 1.3)?
>> Some updates could be failing for some reason.
>> Also grep for "Exception" in the solr log file.
>>
>> -Yonik
>>
>> On Mon, Jun 16, 2008 at 4:02 PM, dls1138 <[EMAIL PROTECTED]> wrote:
>>>
>>> I've been sending data in batches to Solr with no errors reported, yet
>>> after
>>> a commit, over 50% of the records I added (before the commit) do not show
>>> up- even after several subsequent commits down the road.
>>>
>>> Is it possible that Solr/Lucene could be disregarding or dropping my add
>>> queries if those queries were executed while a commit was running?
>>>
>>> For example, if I add 300 records, and then do a commit- during the 10-20
>>> seconds for the commit to execute (on an index over 1.2M records), if I
>>> add
>>> 100 more records during that 10-20 second time period, are those adds
>>> lost?
>>> I'm assuming they are not and will be visible after the next commit, but
>>> I
>>> want to be sure as it seems that some are being dropped. I just need to
>>> know
>>> if this can happen during commits or if I should be looking elsewhere to
>>> resolve my dropped record problem.
>>>
>>> Thanks.
>>>
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Adding-records-during-a-commit-tp17872257p17872257.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/Adding-records-during-a-commit-tp17872257p17874274.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: Dismax + Dynamic fields

2008-06-17 Thread Yonik Seeley
On Tue, Jun 17, 2008 at 3:36 AM, Norberto Meijome <[EMAIL PROTECTED]> wrote:
> On Mon, 16 Jun 2008 14:22:12 -0400
> "Yonik Seeley" <[EMAIL PROTECTED]> wrote:
>
>> There are two levels of dynamic field support.
>>
>> Specific dynamic fields can be queried with dismax, but you can't
>> wildcard the "qf" or other field parameters.
>
> Thanks Yonik. ok, that matches what I've seen - if i know the actual name of 
> the field I'm after, I can use it in a query it, but i can't use the 
> dynamic_field_name_* (with wildcard) in the config.
>
> Is adding support for this something that is desirable / needed (doable??) , 
> and is it being worked on ?

It does make sense in certain scenarios, but I don't think anyone is
working on it.

-Yonik


Re: Default result rows

2008-06-18 Thread Yonik Seeley
2008/6/18 Mihails Agafonovs <[EMAIL PROTECTED]>:
> Doesn't work :(. None of the parameters in the "defaults" section is
> being read.

Everyone uses this functionality, so it's a bug in your request or
config somewhere.
- Make sure you restarted Solr after changing solrconfig.xml
- Make sure you changed the defaults in the right request handler
- Add echoParams=all to your request to see what parameters are being used
- if you can't get it to work, post the URL of the query you are using
to test, the response output, and the relevant part of the
solrconfig.xml

-Yonik


Re: scaling / sharding questions

2008-06-18 Thread Yonik Seeley
On Wed, Jun 18, 2008 at 5:53 PM, Phillip Farber <[EMAIL PROTECTED]> wrote:
> Does this mean that the Lucene scoring algorithm is computed without the idf
> factor, i.e. we just get term frequency scoring?

No, it means that the idf calculation is done locally on a single shard.
With a big index that is randomly mixed, this should not have a
practical impact.

> 2) Doesn't support consistency between stages, e.g. a shard index can be
> changed between STAGE_EXECUTE_QUERY and STAGE_GET_FIELDS
>
> What does this mean or where can I find out what it means?

STAGE_EXECUTE_QUERY finds the ids of matching documents.
STAGE_GET_FIELDS retrieves the fields of matching documents.

A change to a document could possibly happen inbetween, and one would
end up retrieving a document that no longer matched the query.  In
practice, this is rarely an issue.

-Yonik


Re: "Did you mean" functionality

2008-06-19 Thread Yonik Seeley
On Thu, Jun 19, 2008 at 2:07 PM, Matthew Runo <[EMAIL PROTECTED]> wrote:
> Is there any work being done on getting this into SolrJ at the moment?

Just a note to those who may be new to SolrJ: you can still access new
or custom functionality in a generic way via getResponse() w/o
explicit SolrJ support.

-Yonik


Re: Solr performance issues

2008-06-19 Thread Yonik Seeley
On Thu, Jun 19, 2008 at 6:11 PM, Sébastien Rainville
<[EMAIL PROTECTED]> wrote:
> I've been using solr for a little without worrying too much about how it
> works but now it's becoming a bottleneck in my application. I have a couple
> issues with it:
>
> 1. My index always gets slower and slower when commiting/optimizing for some
> obscure reason. It goes from 1 second with a new index to 45 seconds with an
> index with the same amount of data but used for a few days. Restarting solr
> doesn't fix it. The only way I found to fix that is to delete the whole
> index completely by deleting the index folder. Then when I rebuild the index
> everything goes back to normal and fast... and then performance slowly
> deteriorates again. So, the amount of data is not a factor because
> rebuilding the index from scratch fixes the problem and I am sending
> "optimize" once in a while... even maybe too often.

This sounds like OS caching to me.  A large amount of a "new" index
that was just written will be in cache and thus much faster to
optimize.

If your index is smaller than the amount of RAM, go to the index
directory of an "old" index, then try "cat * > /dev/null" and then try
optimize to see of that's the case.

> 2. I use acts_as_solr and by default they only make "post" requests, even
> for /select. With that setup the response time for most queries, simple or
> complex ones, were ranging from 150ms to 600ms, with an average of 250ms. I
> changed the select request to use "get" requests instead and now the
> response time is down to 10ms to 60ms. Did someone seen that before? Why is
> it doing it?

Are the get requests being cached by the ruby stuff?

But even with no caching, I've seen differences with get/post on Linux
with the python client when persistent HTTP connections were in use.
I tracked it down to the POST being written in two parts, triggering
nagle's algorithm in the networking stack.

-Yonik


Re: Lucene 2.4-dev source ?

2008-06-25 Thread Yonik Seeley
trunk is the latest version (which is currently 2.4-dev).
http://svn.apache.org/viewvc/lucene/java/trunk/

There is a contrib directory with things not in lucene-core:
http://svn.apache.org/viewvc/lucene/java/trunk/contrib/

-Yonik


Re: Solr 1.3 deletes not working?

2008-06-25 Thread Yonik Seeley
On Wed, Jun 25, 2008 at 8:44 PM, Galen Pahlke <[EMAIL PROTECTED]> wrote:
> I'm having trouble deleting documents from my solr 1.3 index.  To delete a
> document, I post something like "12345" to the
> solr server, then issue a commit.  However, I can still find the document in
> the index via the query "id:12345".

That's strange  there are unit tests for this, and I just verified
it worked on the example data.
Perhaps the schema no longer matches what you indexed (or did you re-index?)
Make sure the uniqueKeyField specifies "id".

>  The document remains visible even after
> I restart the solr server.  I know the server is receiving my delete
> commands, since deletesById goes up on the stats page, but docsDeleted stays
> at 0.

docsDeleted is no longer tracked since Lucene now handles the document
overwriting itself.
It should probably be removed.

-Yonik


Re: Sorting questions

2008-06-25 Thread Yonik Seeley
It's not exactly what you want, but putting specific documents first
for certain queries has been done via
http://wiki.apache.org/solr/QueryElevationComponent

-Yonik

On Wed, Jun 25, 2008 at 6:58 PM, Yugang Hu <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I have the same issue as described in:
> http://www.nabble.com/solr-sorting-question-td17498596.html. I am trying to
> have some categories before others in search results for different search
> terms. For example, for search team "ABC", I want to show Category "CCC"
> first, then Category "BBB", "AAA", "DDD" and for search team "CBA", I
> want to show Category "DDD" first, then Category "CCC", "AAA", "BBB"...
> Is this possible in Solr? Has someone done this before?
>
> Any help will be appreciated.
>
> Thanks,
>
> Yugang
>
>
>


Re: Solr 1.3 deletes not working?

2008-06-25 Thread Yonik Seeley
On Wed, Jun 25, 2008 at 9:34 PM, Galen Pahlke <[EMAIL PROTECTED]> wrote:
> I originally tested with an index generated by solr 1.2, but when that
> didn't work, I rebuilt the index from scratch.
> From my schema.xml:
>
> 
>   .
>required="true"/>
>   .
> 
>
> id

I tried this as well... changing the example schema id type to
integer, adding a document and deleting it.  Everything worked fine.

Something to watch out for: when you indexed the data, could it have
had spaces in the id or something?

If you can't figure it out, try reproducing it in a simple example
that can be added to a JIRA issue.

-Yonik


Re: Slow deleteById request

2008-07-01 Thread Yonik Seeley
That's very strange... are you sending a commit with the delete perhaps?
If so, the whole request would block until a new searcher is registered.

-Yonik

On Tue, Jul 1, 2008 at 8:54 AM, Renaud Delbru <[EMAIL PROTECTED]> wrote:
> Hi,
>
> We experience very slow delete, taking more than 10 seconds. A delete is
> executed using deleteById (from Solrj or from curl), at the same time
> documents are being added.
> By looking at the log (below), it seems that a delete by ID request is only
> executed during the next commit (done automatically every 1000 added
> documents), and that the process (Solrj or curl) executing the deleteById
> request is blocked until the commit is performed.
>
> Is it a normal behavior or a misconfiguration of our Solr server ?
>
> Thanks in advance for insights.
>
> [11:32:02.840]autowarming result for [EMAIL PROTECTED] main
> [11:32:02.840]
>  
> queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=512,evictions=0,size=512,cumulative_lookups=238825,cumulative_hits=202879,cumulative_hitratio=0.84,cumulative_inserts=36255,c
> umulative_evictions=4289}
> [11:32:02.840]autowarming [EMAIL PROTECTED] main from [EMAIL PROTECTED] main
> [11:32:02.840]
>  
> documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,cumulative_lookups=2395306,cumulative_hits=1705483,cumulative_hitratio=0.71,cumulative_inserts=689823,cumulative_evictions=411577}
> [11:32:02.840]autowarming result for [EMAIL PROTECTED] main
> [11:32:02.840]
>  
> documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,cumulative_lookups=2395306,cumulative_hits=1705483,cumulative_hitratio=0.71,cumulative_inserts=689823,cumulative_evictions=411577}
> [11:32:02.840]Registered new searcher [EMAIL PROTECTED] main
> [11:32:02.840]{delete=[http://example.org/]} 0 14212
> [11:32:02.840]webapp=/index path=/update params={wt=xml&version=2.2}
> status=0 QTime=14212
> [11:32:02.840]DirectUpdateHandler2 deleting and removing dups for 217 ids
> [11:32:02.840]Closing Writer DirectUpdateHandler2
> [11:32:02.842]Closing [EMAIL PROTECTED] main
> [11:32:02.842]
>  
> filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
> [11:32:02.842]
>  
> queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=512,evictions=0,size=512,cumulative_lookups=238825,cumulative_hits=202879,cumulative_hitratio=0.84,cumulative_inserts=36255,cumulative_evictions=4289}
> [11:32:02.842]
>  
> documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,cumulative_lookups=2395306,cumulative_hits=1705483,cumulative_hitratio=0.71,cumulative_inserts=689823,cumulative_evictions=411577}
> [11:32:02.894]Opening [EMAIL PROTECTED] DirectUpdateHandler2
> [11:32:03.566]DirectUpdateHandler2 docs deleted=0
> [11:32:03.566]Closing [EMAIL PROTECTED] DirectUpdateHandler2
>
> --
> Renaud Delbru
>


Re: Slow deleteById request

2008-07-01 Thread Yonik Seeley
On Tue, Jul 1, 2008 at 4:05 PM, Renaud Delbru <[EMAIL PROTECTED]> wrote:
> We are not sending a commit with a delete. It happens when using the
> following command:
> curl http://mydomain.net:8080/index/update -s -H 'Content-type:text/xml;
> charset=utf-8' -d "http://example.org/"
> or using the SolrJ deleteById method (that does not execute a commit as far
> as I know).
>
> The strange things is that it is not always reproduced. Ten or so delete
> requests will be executed fast (in few ms), then a batch of few delete
> requests will take 10, 20 or even 30 seconds.
>
> By looking more precisely at the log, it seems that, in fact,  the delete
> request triggers the opening of a new searcher,

Yes,  perhaps you have maxPendingDeletes configured.

I'd try the latest nightly solr build... it now lets Lucene manage the deletes.

-Yonik


Re: MergeException

2008-07-02 Thread Yonik Seeley
Doug, it looks like it might be this Lucene bug:
https://issues.apache.org/jira/browse/LUCENE-1262

What version of Lucene is in the Solr you are running?  You might want
to try either one of the latest Solr nightly builds, or at least
upgrading your Lucene version in Solr if it's not the latest patch
release.

-Yonik

On Wed, Jul 2, 2008 at 9:03 AM, Doug Steigerwald
<[EMAIL PROTECTED]> wrote:
> What exactly does this error mean and how can we fix it?  As far as I can
> tell, all of our 30+ cores seem to be updating and autocommiting fine.  By
> fine I mean our autocommit hook is firing for all cores which leads me to
> believe that the commit is happening, but segments can't be merged.  Are we
> going to have to rebuild whatever core this happens to be (if I can figure
> it out)?
>
> Exception in thread "Thread-704"
> org.apache.lucene.index.MergePolicy$MergeException:
> java.lang.IndexOutOfBoundsException: Index: 43, Size: 43
>at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:271)
> Caused by: java.lang.IndexOutOfBoundsException: Index: 43, Size: 43
>at java.util.ArrayList.RangeCheck(Unknown Source)
>at java.util.ArrayList.get(Unknown Source)
>at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:260)
>at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:154)
>at
> org.apache.lucene.index.SegmentReader.document(SegmentReader.java:659)
>at
> org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:319)
>at
> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:133)
>at
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3109)
>at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2834)
>
> Thanks.
> Doug
>


Re: Fetching float or int fields from index by Lucene document

2008-07-02 Thread Yonik Seeley
See the methods on FieldType, esp toExternal()

-Yonik

On Wed, Jul 2, 2008 at 5:39 PM, Kevin Osborn <[EMAIL PROTECTED]> wrote:
> As part of my results, I am building a lot of facet information. For example, 
> an Attribute ID also needs to return the Attribute Text.
>
> So, I have code like the following (really in a cache):
>
> Term term = new Term ("AtrID", "A0001");
> Document doc = searcher.doc(searcher.getFirstMatch(term));
>
> return doc.get("AtrText");
>
> This works great for string fields. But, if I am looking at a field that is 
> non-string in Document.get(), I get strange characters. I also notice if I 
> just print out doc.toString(). Most fields look fine, but int or float fields 
> are all messed up. I assume this is because the Lucene index really is just 
> text, so Solr must do some sort of encoding here.
>
> Is there anyway to decode the string into something readable?


Re: Fetching float or int fields from index by Lucene document

2008-07-02 Thread Yonik Seeley
On Wed, Jul 2, 2008 at 10:45 PM, Chris Hostetter
<[EMAIL PROTECTED]> wrote:
> It never really occured to me before, but it is kind of weird that there
> is a toInternal and a toExternal and an indexedToReadable -- but there is
> no readableToIndexed ... toInternal is used both for the "indexed" value
> and for the "stored" value, so things like SortableIntField wind up
> "storing" an encoded value even though there isn't much need for it.

Legacy stuff in a way.  In the beginning there was only toInternal()
and toExternal().  I added indexedToReadable(), storedToReadable(),
storedToIndexed() later for more completeness.  I think I didn't add a
readableToIndexed() since that was really what toInternal() was.

Really, the stored field could be a normal readable value - no reason
it has to match the indexed value.
I made them the same as an optimization to bypass the analyzer since
there was no way to just give the token directly to the IndexWriter
(worth it?  I don't know...)

-Yonik


Re: Big slowdown with phrase queries

2008-07-03 Thread Yonik Seeley
On Thu, Jul 3, 2008 at 6:04 PM, Chris Harris <[EMAIL PROTECTED]> wrote:
> Now I gather that phrase queries are inherently slower than non-phrase
> queries, but 1-3 orders of magnitude difference seems noteworthy.

Phrase queries could be a couple times slower, but normally not to the
degree you show here.

The most likely factor is that phrase queries need to look at term
positions, and those are in a different part of the index that may not
be cached by the OS (esp if phrase queries are rare in your system).
You may not even have enough system RAM free to allow caching
positions also.

Check your index and look at the total size of the .tis files, the
.frq files, and the .prx files.
.tis and .frq is used to look up terms and what documents match those
terms.  .prx files are used for the term positions in each document.

You may also want to test things out in a more controlled manner (a
system with no live traffic, etc) to narrow things down some more.

> This is on Solr r654965, which I don't think is *too* far behind the
> trunk version. 1200Mb RAM allocated to Solr. 8M documents. Lots of
> compressed, stored fields. Most docs are probably like 50Kb, but some
> of them might be 10Mb, 100Mb. The index as a whole is 106GB.
> maxFieldLength=1. The index was recently optimized. (It has only
> one segment right now.)
>
> I'm thinking that even supposing I've indexed everything in a horrible
> inefficient manner, and even supposing my machine is woefully
> underpowered, that wouldn't really explain why the phrase queries
> would be *that* much slower, would it? Any ideas? Indexing with
> termPositions wouldn't help, would it?

No.  TermVectors are not used for phrase queries.

> (Now I'm not using
> termPositions or termVectors.) Or what if I used an alternative query
> parser, so phrase queries could be implemented in terms of the
> SpanNearQuery class rather than the PhraseQuery class?

Span queries would be slower than phrase queries.

-Yonik


Re: Big slowdown with phrase queries

2008-07-03 Thread Yonik Seeley
On Thu, Jul 3, 2008 at 7:05 PM, Chris Harris <[EMAIL PROTECTED]> wrote:
> Ok, I only have one segment right now, so I've got one of each of these:
>
> .tis file: 730MB
> .frq files: 9KB
> .prx file: 26KB

That's pretty much impossible (way too small).  Double check those numbers.

> If I'm understanding you (and Mike) properly, then even though it's
> the prx file that contains the actual position info, you can't get to
> that info quickly unless the tis file is also cached in RAM by the OS.

Right.

> I have to admit I don't know that much about OS disk caching. Can I
> more or less pretend that the OS uses a least recently used (LRU)
> algorithm?

Pretty much.
How much physical RAM do you have?

-Yonik


Re: Bulk delete

2008-07-04 Thread Yonik Seeley
On Fri, Jul 4, 2008 at 10:52 AM, Jonathan Ariel <[EMAIL PROTECTED]> wrote:
> Is there any good way to do a bulk delete of several documents?
> I have more than 1000 documents to delete... and I don't want to send N
> request with X.
> Doing a query delete isn't a good solution because I have a maximum amount
> of terms that I can use in the query. For example:
> id:(X1 OR X2 OR  OR Xn) where n could be
> more than 1000

As of Solr 1.3, you can specify multiple ids

12

-Yonik


Re: Bulk delete

2008-07-04 Thread Yonik Seeley
On Fri, Jul 4, 2008 at 12:06 PM, Jonathan Ariel <[EMAIL PROTECTED]> wrote:
> Yes, I just wanted to avoid N requests and do just 2.

Note that you can do it in a single request if you really want... just
add ?commit=true to the URL.

-Yonik


Re: implementing a random result request handler - solr 1.2

2008-07-07 Thread Yonik Seeley
If it's just a random ordering you are looking for, it's implemented
in the latest Solr 1.3
Solr 1.3 should be out soon, so if you are just starting development,
I'd start with the latest Solr version.

If you really need to stick with 1.2 (even after 1.3 is out?)  then
RandomSortField should be easy to backport to 1.2

-Yonik

On Mon, Jul 7, 2008 at 1:15 PM, Sean Laval <[EMAIL PROTECTED]> wrote:
> Well its simply a business requirement from my perspective. I am not sure I
> can say more than that. I could maybe implement a request handler that did
> an initial search to work out how many hits there are resulting from the
> query and then did as many more queries as were required fetching just 1
> document starting at a given random number .. would that work? Sounds a bit
> cludgy to me even as I say it.
>
> Sean
>
>
>
> --
> From: "Walter Underwood" <[EMAIL PROTECTED]>
> Sent: Monday, July 07, 2008 5:06 PM
> To: 
> Subject: Re: implementing a random result request handler - solr 1.2
>
>> Why do you want random hits? If we know more about the bigger
>> problem, we can probably make better suggestions.
>>
>> Fundamentally, Lucene is designed to quickly return the best
>> hits for a query. Returning random hits from the entire
>> matched set is likely to be very slow. It just isn't what
>> Lucene is designed to do.
>>
>> wunder
>>
>> On 7/7/08 8:58 AM, "Sean Laval" <[EMAIL PROTECTED]> wrote:
>>
>>> I have seen various posts about implementing random sorting relating to
>>> the
>>> 1.3 code base but I am trying to do this in 1.2. Does anyone have any
>>> suggestions? The approach I have considered is to implement my own
>>> request
>>> handler that picks random documents from a larger result list. I
>>> therefore
>>> need to be able to create a DocList and add documents to it but can't
>>> seem to
>>> do this. Does anyone have any advice they could offer please?
>>>
>>> Regards,
>>>
>>> Sean
>>
>>
>


Re: implementing a random result request handler - solr 1.2

2008-07-07 Thread Yonik Seeley
On Mon, Jul 7, 2008 at 1:40 PM, Sean Laval <[EMAIL PROTECTED]> wrote:
> The RandomSortField in 1.3 each time you then issue a query, you get the
> same random sort order right? That is to say the randomness is implemented
> at index time rather than search time?

See the comment in the example schema:




-Yonik


Re: schema.xml compatibility

2008-07-09 Thread Yonik Seeley
On Wed, Jul 9, 2008 at 7:13 PM, Teruhiko Kurosaka <[EMAIL PROTECTED]> wrote:
> I've noticed that schema.xml in the dev version of Solr spells
> what used to be fieldtype as fieldType with capital T.
>
> Are there any other compatibility issues between the would-be
> Solr 1.3 and Solr 1.2?

It shouldn't be a compatibility issue since both will be accepted.
The xpath used to select fieldType nodes is
"/schema/types/fieldtype | /schema/types/fieldType"

> How soon Solr 1.3 will be available, by the way?

Hopefully soon... perhaps the end of the month.

-Yonik


Re: Certain form of autocomplete (like Google Suggest)

2008-07-09 Thread Yonik Seeley
Would facet.prefix work for you?

-Yonik

On Fri, Jul 4, 2008 at 4:58 AM, Marian Steinbach <[EMAIL PROTECTED]> wrote:
> Hi all!
>
> I just startet evaluating Solr a few days ago and I'm quite happy with
> the way it works. The test project I am using on is a product search
> for a wine shop with 2500 articles and about 20 fields, with faceted
> search.
>
> Now I'd like to know what would be the best way to implement a search
> term autocompletion in the way of Google Suggest
> (http://www.google.com/webhp?complete=1&hl=en).
>
> Most autocomplete implementations aim to display search result entries
> during input. What Suggest does, and what I'd like to accomplish, is
> an automatic suggestion of relevant index terms. This would help users
> to prevent spelling problems, which are a huge issue in the domain of
> wine, where almost every other term is french.
>
> Szenario:
>
> 1) The user types "sa" into a query input field.
> 2) The system searches for the 10 most frequent index terms starting
> with "sa" and displays the result in a menu.
> 3) The user adds a 3rd character => input is now "sau"
> 4) The system searches for the 10 most frequent index terms starting
> with "sa" and displays the result in a menu.
> 5) The user clicks on "sauvignon" in the menu and the term in the
> input field is completed to "sauvignon".
>
> So, what I need technically is the (web service) query that delivers
> all index terms (for specific index fields) starting with a certain
> prefix. The result should be ordered by frequency and limited to a
> certain amount of entries.
>
> Is this functionality already available in the Solr core?
>
> It seems as if "Schema Browser" functionality of the luke webapp (part
> of the nightly build) does something similar, but I can't find out how
> to limit the term lists to match the requirements above.
>
> I have to mention that I'm not an experienced Java developer. :)
>
> Thanks for your help!
>
> Marian
>


Re: Document rating/popularity and scoring

2008-07-11 Thread Yonik Seeley
See ExternalFileField and BoostedQuery

-Yonik

On Fri, Jul 11, 2008 at 11:47 AM, climbingrose <[EMAIL PROTECTED]> wrote:
> Hi all,
> Has anyone tried to factor rating/popularity into Solr scoring? For example,
> I want documents with more page views to be ranked higher in the search
> results. From what I can see, the most difficult thing is that we have to
> update the number of page views for each document. With Solr-139, document
> can be updated at field level. However, it still have to retrieve the
> document and then do a reindex. With high traffic sites, the overhead might
> be too high.
>
> I'm thinking of using relational database to track page views / ratings and
> then do a daily sync with Solr. Is there a way for Solr to retrieve data
> from external sources (database server) and use the data for determining
> document ranking?
>
> Thanks.
>
> --
> Regards,
>
> Cuong Hoang
>


Re: Max Warming searchers error

2008-07-11 Thread Yonik Seeley
You're trying to commit too fast and warming searchers are stacking up.
Do less warming of caches, or space out your commits a little more.

-Yonik

On Fri, Jul 11, 2008 at 11:56 AM, sundar shankar
<[EMAIL PROTECTED]> wrote:
> Hi ,
>I am getting the "Error opening new searcher. exceeded limit of 
> maxWarmingSearchers=4, try again later." My configuration includes enabling 
> coldSearchers to true and Having number of maxWarmimgSearchers as 4. We 
> expect a max of 40 concurrent users but an average of 5-10 at most times. 
> Would this configuration not work?
>
> -Sundar
> _
> Wish to Marry Now? Click Here to Register FREE
> http://www.shaadi.com/registration/user/index.php?ptnr=mhottag


Re: too many open files

2008-07-14 Thread Yonik Seeley
On Mon, Jul 14, 2008 at 5:14 AM, Alexey Shakov <[EMAIL PROTECTED]> wrote:
> now we have set the limt to ~1 files
> but this is not the solution - the amount of open files increases
> permanantly.
> Earlier or later, this limit will be exhausted.

How can you tell? Are you seeing descriptor use continually growing?

-Yonik


Re: too many open files

2008-07-14 Thread Yonik Seeley
Solr uses reference counting on IndexReaders to close them ASAP (since
relying on gc can lead to running out of file descriptors).

-Yonik

On Mon, Jul 14, 2008 at 9:15 AM, Brian Carmalt <[EMAIL PROTECTED]> wrote:
> Hello,
>
> I have a similar problem, not with Solr, but in Java. From what I have
> found, it is a usage and os problem: comes from using to many files, and
> the time it takes the os to reclaim the fds. I found the recomendation
> that System.gc() should be called periodically. It works for me. May not
> be the most elegant, but it works.
>
> Brian.
>
> Am Montag, den 14.07.2008, 11:14 +0200 schrieb Alexey Shakov:
>> now we have set the limt to ~1 files
>> but this is not the solution - the amount of open files increases
>> permanantly.
>> Earlier or later, this limit will be exhausted.
>>
>>
>> Fuad Efendi schrieb:
>> > Have you tried [ulimit -n 65536]? I don't think it relates to files
>> > marked for deletion...
>> > ==
>> > http://www.linkedin.com/in/liferay
>> >
>> >
>> >> Earlier or later, the system crashes with message "Too many open files"
>> >
>> >
>>
>>
>>
>
>


Re: too many open files

2008-07-14 Thread Yonik Seeley
On Mon, Jul 14, 2008 at 9:52 AM, Fuad Efendi <[EMAIL PROTECTED]> wrote:
> Even Oracle requires 65536; MySQL+MyISAM depends on number of tables,
> indexes, and Client Threads.
>
> From my experience with Lucene, 8192 is not enough; leave space for OS too.
>
> Multithreaded application (in most cases) multiplies number of files to a
> number of threads (each thread needs own handler), in case with SOLR-Tomcat:
> 256 threads... Number of files depends on mergeFactor=10 (default for SOLR).
> Now, if 10 is "merge factor" and we have *.cfs, *.fdt, etc (6 file types per
> segment):
> 256*10*6 = 15360 (theoretically)

In Solr, the number of threads does not come into play.  It would only
matter in Lucene if you were doing something like opening an
IndexReader per thread or something.

The number of files per segment is normally more like 12, but only 9
of them are held open for the entire life of the reader.

Also remember that the IndexWriter internally uses another IndexReader
to do deletions, and Solr can have 2 (or more) open... one serving
queries and one opening+warming.

-Yonik


Re: - uniqueness not enforced?

2008-07-14 Thread Yonik Seeley
On Mon, Jul 14, 2008 at 10:01 AM, Fuad Efendi <[EMAIL PROTECTED]> wrote:
> I was adding new documents with same Id in a hope that it will replace old
> one; instead, new document added with same Id. Is it bug?

It should work fine.
It might be a schema issue for you... try the solr example setup.

-Yonik


Re: too many open files

2008-07-14 Thread Yonik Seeley
On Mon, Jul 14, 2008 at 10:17 AM, Alexey Shakov <[EMAIL PROTECTED]> wrote:
> Yonik Seeley schrieb:
>>
>> On Mon, Jul 14, 2008 at 5:14 AM, Alexey Shakov <[EMAIL PROTECTED]>
>>> now we have set the limt to ~1 files
>>> but this is not the solution - the amount of open files increases
>>> permanantly.
>>> Earlier or later, this limit will be exhausted.
>>>
>>
>> How can you tell? Are you seeing descriptor use continually growing?
>>
>> -Yonik
>
> 'deleted' index files, listed with lsof-command today, was listed (also as
> deleted)
> several days ago...
>
> But the amount of this 'deleted' files increases. So, I make a conclusion,
> that this is the question of time, when the limit of 1 will be reached.

You are probably just seeing growth in the number of segments in the
index... which means that any IndexReader will be holding open a
larger number of files at any one time (and become deleted when the
IndexWriter removes old segments).

This growth in the number of segments isn't unbounded though (because
of segment merges).  Your 10,000 descriptors should be sufficient.

-Yonik


> Fortunately, the check-ins of new documents are seldom. The server (Tomcat)
> was restarted (due to different software updates) relatively oft in the last
> weeks...  So, we had no yet the possibility to reach this limit. But the
> default open file limit (1024) was already reached several times (before we
> increase it)…
>
> Thanks for your help !
> Alexey


Re: WordDelimiterFilter splits at non-ASCII chars

2008-07-15 Thread Yonik Seeley
On Tue, Jul 15, 2008 at 10:29 AM, Stefan Oestreicher
<[EMAIL PROTECTED]> wrote:
> as I understand the WordDelimiterFilter should split on case changes, word
> delimiters and changes from character to digit, but it should not
> differentiate between ASCII and multibyte chars. It does however. The word
> "hälse" (german plural of "neck") gets split into "h", "ä" and "lse", which
> unfortunately renders this filter quite unusable for me. Am i missing
> something or is this a bug?
> I'm using solr 1.3 built from trunk.

Look for charset issues in communicating with Solr.  I just tried this
with the "text" field via Solr's analysis.jsp and it works fine.

-Yonik


Re: Filter by Type increases search results.

2008-07-15 Thread Yonik Seeley
On Tue, Jul 15, 2008 at 11:10 AM, Norberto Meijome <[EMAIL PROTECTED]> wrote:
> On Tue, 15 Jul 2008 18:07:43 +0530
> "Preetam Rao" <[EMAIL PROTECTED]> wrote:
>
>> When I say filter, I meant q=fish&fq=type:idea
>
> btw, this *seems* to only work for me with standard search handler. dismax 
> and fq: dont' seem to get along nicely... but maybe, it is just late and i'm 
> not testing it properly..

It should work the same... the only thing dismax does differently now
is change the type of the base query to "dismax".

-Yonik


Re: solr synonyms behaviour

2008-07-15 Thread Yonik Seeley
On Tue, Jul 15, 2008 at 2:27 PM, swarag <[EMAIL PROTECTED]> wrote:
> To my understanding, this means I am using synonyms at index time and NOT
> query time. And yet, I am still having these problems with synonyms.

Can you give a specific example?  Use debugQuery=true to see what the
resulting query is.
You can also use the admin analysis page to see what the output of the
index and query analyzers.

-Yonik


Re: FileBasedSpellChecker behavior?

2008-07-15 Thread Yonik Seeley
On Tue, Jul 15, 2008 at 4:19 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> agreed, but there is a problem in Solr, AIUI, with regards to when the
> readers are available and when inform() gets called.  The workaround is to
> have a warming query, I believe.

Right... see https://issues.apache.org/jira/browse/SOLR-593

-Yonik


Re: Multiple query fields in DisMax handler

2008-07-17 Thread Yonik Seeley
On Thu, Jul 17, 2008 at 8:11 AM, chris sleeman <[EMAIL PROTECTED]> wrote:
> What I actually meant was whether or not I could create a configuration for
> a dismax query parser and then refer to it in my filter query. I already
> have a standard request handler with a "dismax" deftype for my query field.
> I wanted to use another dismax parser for the fq param, on the lines of what
> Ryan and Erik had suggested. Just dont want to specify all the params for
> this dismax at query time.

To separate the configuration from the client, you could at least grab
the parameter values from config (you still would need to specify the
parameter names though).  For a filter, some of the params you would
simply want to zero out.

fq={!dismax qf=$filter_qf pf= bf=}CA
  and set filter_qf as a default in solrconfig.xml
  OR separating out the actual query value to make this easier to compose:
fq={!dismax qf=$filter_qf pf= bf= v=$q2}&q2=CA

That's currently the closest you can get out of the box I think.

-Yonik

> My actual query would then simply look like - "
> http://localhost:8983/solr/select?q=*:*&fq={!dismaxL}CA";, instead of
> specifying all the qf, pf, etc fields as part of the dismax syntax within
> the query.
>
> Regards,
> Chris
>
> On Thu, Jul 17, 2008 at 5:18 PM, Preetam Rao <[EMAIL PROTECTED]>
> wrote:
>
>> If I understand the question correctly, you can provide init params,
>> default
>> params and invariant params in the appropriate request handler section in
>> solrconfig.xml.
>> So you can create a standard request handler with name dismaxL, whose
>> defType is dismax and set all parameters in defaults section.
>>
>> 
>> Preetam
>>
>> On Thu, Jul 17, 2008 at 4:35 PM, chris sleeman <[EMAIL PROTECTED]>
>> wrote:
>>
>> > Thanks a lot..this is, more or less, what i was looking for.
>> >
>> > However, is there a way to pre-configure the dismax query parser, with
>> > parameters like qf, pf, boost etc., in solr-config.xml, rather than doing
>> > so
>> > at query time. So my actual query would look like - <
>> > http://localhost:8983/solr/select?q=<
>> >
>> http://localhost:8983/solr/select?q=*:*&fq=%7B%21dismax%20qf=%22name%22%7Dipod&debugQuery=true
>> > >
>> > query&fq={!dismaxL}CA&debugQuery=true<
>> >
>> http://localhost:8983/solr/select?q=*:*&fq=%7B%21dismax%20qf=%22name%22%7Dipod&debugQuery=true
>> > >>,
>> > where dismaxL refers to a query parser defined in solrconfig, with all
>> the
>> > necessary parameters. The q parameter would then use the default dismax
>> > parser defined for the handler and fq would use dismaxL.
>> >
>> > Regards,
>> > Chris
>> >
>> > On Thu, Jul 17, 2008 at 5:15 AM, Erik Hatcher <
>> [EMAIL PROTECTED]>
>> > wrote:
>> >
>> > > On Jul 16, 2008, at 7:38 PM, Ryan McKinley wrote:
>> > >
>> > >> (assuming you are using 1.3-dev), you could use the dismax query
>> parser
>> > >> syntax for the fq param.  I think it is something like:
>> > >> fq=your query
>> > >>
>> > >
>> > > The latest committed syntax is:
>> > >
>> > >   {!dismax qf=""}your query
>> > >
>> > > For example, with the sample data: <
>> > >
>> >
>> http://localhost:8983/solr/select?q=*:*&fq={!dismax%20qf=%22name%22}ipod&debugQuery=true
>> <
>> http://localhost:8983/solr/select?q=*:*&fq=%7B%21dismax%20qf=%22name%22%7Dipod&debugQuery=true
>> >
>> > <
>> >
>> http://localhost:8983/solr/select?q=*:*&fq=%7B%21dismax%20qf=%22name%22%7Dipod&debugQuery=true
>> > >
>> > > >
>> > >
>> > >  I can't find the syntax now (Yonik?)
>> > >>
>> > >> but I don't know how you could pull out the qf,pf,etc fields for the
>> fq
>> > >> portion vs the q portion.
>> > >>
>> > >
>> > > You can add parameters like the qf above, within the {!dismax ... }
>> area.
>> > >
>> > >Erik
>> > >
>> > >
>> >
>> >
>> > --
>> > Bill Cosby  - "Advertising is the most fun you can have with your clothes
>> > on."
>> >
>>
>
>
>
> --
> Yogi Berra  - "A nickel ain't worth a dime anymore."
>


Re: the design of the solr admin page

2008-07-20 Thread Yonik Seeley
That's like asking if Solr should be made faster or easier to use :-)
No logo/cosmetic changes have been made to the admin pages recently
because no one has taken the time to find consensus and submit a
patch.  Getting everyone to say "yes let's change things for the
better" won't change anything.

Perhaps a poll about what logo people prefer would be a more concrete step?

-Yonik

On Sun, Jul 20, 2008 at 2:56 PM, Mark Miller <[EMAIL PROTECTED]> wrote:
> http://myhardshadow.com/poll.html
>
> If you have a second, please answer this single question poll about the look
> of the solr admin pages. It would be nice to gauge what the community thinks
> of the current look. Take off of my question earlier to the solr dev list:
> http://www.nabble.com/solr-admin-look-td18556364.html
>
> Obviously a lack of votes will signal a lack of interest in the look of the
> solr admin pages, which is basically option 4.
>
> - Mark


Re: spellchecker problems (bugs)

2008-07-22 Thread Yonik Seeley
On Tue, Jul 22, 2008 at 11:07 AM, Geoffrey Young
<[EMAIL PROTECTED]> wrote:
> Shalin Shekhar Mangar wrote:
>>
>> The problems you described in the spellchecker are noted in
>> https://issues.apache.org/jira/browse/SOLR-622 -- I shall create an issue
>> to
>> synchronize spellcheck.build so that the index is not corrupted.
>
> I'd like to discuss this a little...
>
> I'm not sure that I want to rebuild the spelling index each time the
> underlying data index changes - the process takes very long and my updates
> are frequent changes to non-spelling related data.
>
> what I'd really like is for a change to my index to not cause an exception.
>  IIRC the "old" way of using a spellchecker didn't work like this at all - I
> could completely rm data/index and leave data/spell in place, add new data,
> not issue cmd=build and the spelling parts still worked just fine (albeit
> with old data).
>
> not to say that SOLR-622 isn't a good idea (it is) but I don't really think
> the entire solution is keeping the spellcheck index in sync.  do they need
> to be kept in sync for things not to implode on me?

Agree... spell check indexes should not have to be in sync, and
anything to keep them in sync automatically should be optional (and
probably disabled by default).

-Yonik


Re: Query for an exact match

2008-07-22 Thread Yonik Seeley
On Tue, Jul 22, 2008 at 11:08 AM, Ian Connor <[EMAIL PROTECTED]> wrote:
> How can I require an exact field match in a query. For instance, if a
> title field contains "Nature" or "Nature Cell Biology", when I search
> title:Nature I only want "Nature" and not "Nature Cell Biology". Is
> that something I do as a query or do I need to re index it with the
> field defined in a certain way?
>
> I have this definition now - but it returns all titles that contain
> "Nature" rather than just the ones that equals it exactly.
>
>
>omitNorms="true"/>

That field definition should do it.
Try title:Nature  it may be that you have a default search field that
has a different analyzer configured.
If that doesn't work, make sure you have reindexed after your schema changes.

-Yonik


Re: Query for an exact match

2008-07-22 Thread Yonik Seeley
On Tue, Jul 22, 2008 at 11:39 AM, Ian Connor <[EMAIL PROTECTED]> wrote:
>  omitNorms="true"/>

This will give you an exact match.  As I said, if it's not, then you
didn't restart and reindex, or you are querying the wrong field.

-Yonik


Re: maximum length of string that Solr can index

2008-07-22 Thread Yonik Seeley
Lucene has a maxFieldLength (the number of tokens to index for a given
field name).
It can be configured via solrconfig.xml:
1

-Yonik

On Tue, Jul 22, 2008 at 11:38 AM, Tom Lord <[EMAIL PROTECTED]> wrote:
> Hi, we've looked for info about this issue online and in the code and am
> none the wiser - help would be much appreciated.
>
> We are indexing the full text of journals using Solr. We currently pass
> in the journal text, up to maybe 130 pages, and index it in one go.
>
> We are seeing Solr stop indexing after ~30 pages or so. That is, when we
> look at the indexed text field using Luke, we can see where it gives up
> collecting information from the text.
>
> What is the maximum size that we can index on? Is this a known issue or
> standard behaviour, or is something else amiss?
>
> If this is standard behaviour, what is the approved way of avoiding this
> issue? Should we index on a per-page basis rather than trying to do 130
> pages as a single document?
>
> thanks in advance,
> Tom.
>
> --
> Tom Lord | ([EMAIL PROTECTED])
>
> Aptivate | http://www.aptivate.org | Phone: +44 1223 760887
> The Humanitarian Centre, Fenner's, Gresham Road, Cambridge CB1 2ES
>
> Aptivate is a not-for-profit company registered in England and Wales
> with company number 04980791.
>
>


Re: big discrepancy between elapsedtime and qtime although enableLazyFieldLoading= true

2008-07-28 Thread Yonik Seeley
That high of a difference is due to the part of the index containing
these particular stored fields not being in OS cache.  What's the size
on disk of your index compared to your physical RAM?

-Yonik

On Mon, Jul 28, 2008 at 4:10 PM, Britske <[EMAIL PROTECTED]> wrote:
>
> Hi all,
>
> For some queries I need to return a lot of rows at once (say 100).
> When performing these queries I notice a big difference between qTime (which
> is mostly in the 15-30 ms range due to caching) and total time taken to
> return the response (measured through SolrJ's elapsedTime), which takes
> between 500-1600 ms.
>
> For queries which return less rows the difference becomes less big.
>
> I presume (after reading some threads in the past) that this is due to solr
> constructing and streaming the response (which includes retrieving the
> stored fields) , which is something that is not calculated in qTime.
>
> Documents have a lot of stored fields (more than 10.000), but at any given
> query a maximum of say 20 are returned (through fl-field ) or used (as part
> of filtering, faceting, sorting)
>
> I would have thought that enabling enableLazyFieldLoading for this situation
> would mean a lot, since so many stored fields can be skipped, but I notice
> no real difference in measuring total elapsed time (or qTime for that
> matter).
>
> Am I missing something here? What criteria would need to be met for a field
> to not be loaded for instance? Should I see a big performance boost in this
> situation?
>
> Thanks,
> Britske
> --
> View this message in context: 
> http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18698590.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: big discrepancy between elapsedtime and qtime although enableLazyFieldLoading= true

2008-07-28 Thread Yonik Seeley
That's a bit too tight to have *all* of the index cached...your best
bet is to go to 4GB+, or figure out a way not to have to retrieve so
many stored fields.

-Yonik

On Mon, Jul 28, 2008 at 4:27 PM, Britske <[EMAIL PROTECTED]> wrote:
>
> Size on disk is 1.84 GB (of which 1.3 GB sits in FDT files if that matters)
> Physical RAM is 2 GB with -Xmx800M set to Solr.
>
>
> Yonik Seeley wrote:
>>
>> That high of a difference is due to the part of the index containing
>> these particular stored fields not being in OS cache.  What's the size
>> on disk of your index compared to your physical RAM?
>>
>> -Yonik
>>
>> On Mon, Jul 28, 2008 at 4:10 PM, Britske <[EMAIL PROTECTED]> wrote:
>>>
>>> Hi all,
>>>
>>> For some queries I need to return a lot of rows at once (say 100).
>>> When performing these queries I notice a big difference between qTime
>>> (which
>>> is mostly in the 15-30 ms range due to caching) and total time taken to
>>> return the response (measured through SolrJ's elapsedTime), which takes
>>> between 500-1600 ms.
>>>
>>> For queries which return less rows the difference becomes less big.
>>>
>>> I presume (after reading some threads in the past) that this is due to
>>> solr
>>> constructing and streaming the response (which includes retrieving the
>>> stored fields) , which is something that is not calculated in qTime.
>>>
>>> Documents have a lot of stored fields (more than 10.000), but at any
>>> given
>>> query a maximum of say 20 are returned (through fl-field ) or used (as
>>> part
>>> of filtering, faceting, sorting)
>>>
>>> I would have thought that enabling enableLazyFieldLoading for this
>>> situation
>>> would mean a lot, since so many stored fields can be skipped, but I
>>> notice
>>> no real difference in measuring total elapsed time (or qTime for that
>>> matter).
>>>
>>> Am I missing something here? What criteria would need to be met for a
>>> field
>>> to not be loaded for instance? Should I see a big performance boost in
>>> this
>>> situation?
>>>
>>> Thanks,
>>> Britske
>>> --
>>> View this message in context:
>>> http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18698590.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18698909.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: big discrepancy between elapsedtime and qtime although enableLazyFieldLoading= true

2008-07-28 Thread Yonik Seeley
On Mon, Jul 28, 2008 at 4:53 PM, Britske <[EMAIL PROTECTED]> wrote:
> Each query requests at most 20 stored fields. Why doesn't help
> lazyfieldloading in this situation?

It's the disk seek that kills you... loading 1 byte or 1000 bytes per
document would be about the same speed.

> Also, if I understand correctly, for optimal performance I need to have at
> least enough RAM to put the entire Index size in OS cache (thus RAM) + the
> amount of RAM that SOLR / Lucene consumes directly through the JVM?

The normal usage is to just retrieve the stored fields for the top 10
(or a window of 10 or 20) documents.  Under this scenario, the
slowdown from not having all of the stored fields cached is usually
acceptable.  Faster disks (seek time) can also help.

> Luckily most of the normal queries return 10 documents each, which results
> in a discrepancy between total elapsed time and qTIme of about 15-30 ms.
> Doesn't this seem strange, since to me it would seem logical that the
> discrepancy would be at least 1/10th of fetching 100 documents.

Yes, in general 1/10th the cost is what one would expect on average.
But some of the docs you are trying to retrieve *will* be in cache, so
it's hard to control this test.
You could try forcing the index out of memory by "cat"ing some other
big files multiple times and then re-trying or do a reboot to be
sure.

-Yonik


<    7   8   9   10   11   12   13   14   >