Re: Index size vs. number of documents

2008-08-15 Thread Chris Hostetter

:  I'm surprised, as you are, by the non-linearity. Out of curiosity, what is

Unless the data in stored fields is significantly greater then indexed 
fields the Index size almost never grows linearly with the number of 
documents -- it's the number of unique terms that tends to primarily 
influence the size of the index.

At some point someone on the java-user list who really understood the file 
formats wrote a really great forumla for estimating the size of the index 
assuming some ratios of unique terms per doc, but i can't find it now.


-Hoss



Can I copy an index built on a Windows system to a Unix/Linux system?

2008-08-15 Thread johnwarde

Hi,

Can I copy an index built on a Windows system to a Unix/Linux system and
still work?

Reason for my question:
I have been working with Solr for the last month on a Windows system and I
have determined that we need to have a replication solution for our future
needs (volume of documents to be indexed and query loads).  

At this point in time it looks like, from my research, that Solr does not
currently provide a reliable/tested replication strategy on Windows.  

However, I would like to continue to use Solr on Windows for now until the
load on the single windows system becomes too great and requires us to
implement a replication strategy (one index master, many query slaves). 
Hopefully, by that time a reliable replication strategy on Windows may
present itself but if it doesn't ... 

Can I make a binary copy of the index files from a windows system to a
Unix/Linux system and be read by a Solr on the Unix/Linux system.  Would
there be any byte order problems? Or would I need to rebuild the index from
the original data?

Many thanks for your help!

John



-- 
View this message in context: 
http://www.nabble.com/Can-I-copy-an-index-built-on-a-Windows-system-to-a-Unix-Linux-system--tp18997540p18997540.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: IndexOutOfBoundsException

2008-08-15 Thread Doug Steigerwald
We actually have this same exact issue on 5 of our cores.  We're just  
going to wipe the index and reindex soon, but it isn't actually  
causing any problems for us.  We can update the index just fine,  
there's just no merging going on.


Ours happened when I reloaded all of our cores for a schema change.  I  
don't do that any more ;).


Doug

On Aug 14, 2008, at 11:08 PM, Yonik Seeley wrote:


Since this looks like more of a lucene issue, I've replied in
[EMAIL PROTECTED]

-Yonik

On Thu, Aug 14, 2008 at 10:18 PM, Ian Connor [EMAIL PROTECTED]  
wrote:

I seem to be able to reproduce this very easily and the data is
medline (so I am sure I can share it if needed with a quick email to
check).

- I am using fedora:
%uname -a
Linux ghetto5.projectlounge.com 2.6.23.1-42.fc8 #1 SMP Tue Oct 30
13:18:33 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
%java -version
java version 1.7.0
IcedTea Runtime Environment (build 1.7.0-b21)
IcedTea 64-Bit Server VM (build 1.7.0-b21, mixed mode)
- single core (will use shards but each machine just as one HDD so
didn't see how cores would help but I am new at this)
- next run I will keep the output to check for earlier errors
- very and I can share code + data if that will help

On Thu, Aug 14, 2008 at 4:23 PM, Yonik Seeley [EMAIL PROTECTED]  
wrote:

Yikes... not good.  This shouldn't be due to anything you did wrong
Ian... it looks like a lucene bug.

Some questions:
- what platform are you running on, and what JVM?
- are you using multicore? (I fixed some index locking bugs  
recently)

- are there any exceptions in the log before this?
- how reproducible is this?

-Yonik

On Thu, Aug 14, 2008 at 2:47 PM, Ian Connor [EMAIL PROTECTED]  
wrote:

Hi,

I have rebuilt my index a few times (it should get up to about 4
Million but around 1 Million it starts to fall apart).

Exception in thread Lucene Merge Thread #0
org.apache.lucene.index.MergePolicy$MergeException:
java.lang.IndexOutOfBoundsException: Index: 105, Size: 33
  at  
org 
.apache 
.lucene 
.index 
.ConcurrentMergeScheduler 
.handleMergeException(ConcurrentMergeScheduler.java:323)
  at org.apache.lucene.index.ConcurrentMergeScheduler 
$MergeThread.run(ConcurrentMergeScheduler.java:300)
Caused by: java.lang.IndexOutOfBoundsException: Index: 105, Size:  
33

  at java.util.ArrayList.rangeCheck(ArrayList.java:572)
  at java.util.ArrayList.get(ArrayList.java:350)
  at  
org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:260)
  at  
org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:188)
  at  
org.apache.lucene.index.SegmentReader.document(SegmentReader.java: 
670)
  at  
org 
.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java: 
349)
  at  
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:134)
  at  
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java: 
3998)
  at  
org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3650)
  at  
org 
.apache 
.lucene 
.index 
.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java: 
214)
  at org.apache.lucene.index.ConcurrentMergeScheduler 
$MergeThread.run(ConcurrentMergeScheduler.java:269)



When this happens, the disk usage goes right up and the indexing
really starts to slow down. I am using a Solr build from about a  
week

ago - so my Lucene is at 2.4 according to the war files.

Has anyone seen this error before? Is it possible to tell which  
Array
is too large? Would it be an Array I am sending in or another  
internal

one?

Regards,
Ian Connor







--
Regards,

Ian Connor





Re: IndexOutOfBoundsException

2008-08-15 Thread Ian Connor
I tried it again (rm -rf /solr/index and post all the docs again) but
this time, I get the error (I also switched to the Sun JVM to see if
that helped):

15-Aug-08 4:57:08 PM org.apache.solr.core.SolrCore execute
INFO: webapp=/solr path=/update params={} status=500 QTime=4576
15-Aug-08 4:57:08 PM org.apache.solr.common.SolrException log
SEVERE: javax.xml.stream.XMLStreamException: required string: field
   at gnu.xml.stream.XMLParser.error(libgcj.so.8rh)
   at gnu.xml.stream.XMLParser.require(libgcj.so.8rh)
   at gnu.xml.stream.XMLParser.readEndElement(libgcj.so.8rh)
   at gnu.xml.stream.XMLParser.next(libgcj.so.8rh)
   at 
org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:323)
   at 
org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:197)
   at 
org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:125)
   at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:128)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1143)
   at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
   at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:272)
   at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
   at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
   at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
   at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
   at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
   at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
   at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
   at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
   at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
   at org.mortbay.jetty.Server.handle(Server.java:285)
   at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
   at 
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
   at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
   at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
   at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
   at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
   at 
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)

2008-08-15 16:57:08.440::WARN:  EXCEPTION
java.lang.NullPointerException
   at org.mortbay.io.bio.SocketEndPoint.getRemoteAddr(SocketEndPoint.java:116)
   at org.mortbay.jetty.Request.getRemoteAddr(Request.java:746)
   at org.mortbay.jetty.NCSARequestLog.log(NCSARequestLog.java:230)
   at 
org.mortbay.jetty.handler.RequestLogHandler.handle(RequestLogHandler.java:51)
   at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
   at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
   at org.mortbay.jetty.Server.handle(Server.java:285)
   at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
   at 
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
   at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
   at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
   at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
   at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
   at 
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)


On Fri, Aug 15, 2008 at 8:26 AM, Doug Steigerwald
[EMAIL PROTECTED] wrote:
 We actually have this same exact issue on 5 of our cores.  We're just going
 to wipe the index and reindex soon, but it isn't actually causing any
 problems for us.  We can update the index just fine, there's just no merging
 going on.

 Ours happened when I reloaded all of our cores for a schema change.  I don't
 do that any more ;).

 Doug

 On Aug 14, 2008, at 11:08 PM, Yonik Seeley wrote:

 Since this looks like more of a lucene issue, I've replied in
 [EMAIL PROTECTED]

 -Yonik

 On Thu, Aug 14, 2008 at 10:18 PM, Ian Connor [EMAIL PROTECTED] wrote:

 I seem to be able to reproduce this very easily and the data is
 medline (so I am sure I can share it if needed with a quick email to
 check).

 - I am using fedora:
 %uname -a
 Linux ghetto5.projectlounge.com 2.6.23.1-42.fc8 #1 SMP Tue Oct 30
 13:18:33 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
 %java -version
 java version 1.7.0
 IcedTea Runtime Environment (build 1.7.0-b21)
 IcedTea 64-Bit Server VM (build 1.7.0-b21, mixed mode)
 - single core (will use shards but each machine just as one HDD so
 didn't see how cores would help but I am new at this)
 - next run I will keep the output to check 

Re: IndexOutOfBoundsException

2008-08-15 Thread Ian Connor
Ignore that error - I think I installed the Sun JVM incorrectly - this
seems unrelated to the error.

On Fri, Aug 15, 2008 at 9:01 AM, Ian Connor [EMAIL PROTECTED] wrote:
 I tried it again (rm -rf /solr/index and post all the docs again) but
 this time, I get the error (I also switched to the Sun JVM to see if
 that helped):

 15-Aug-08 4:57:08 PM org.apache.solr.core.SolrCore execute
 INFO: webapp=/solr path=/update params={} status=500 QTime=4576
 15-Aug-08 4:57:08 PM org.apache.solr.common.SolrException log
 SEVERE: javax.xml.stream.XMLStreamException: required string: field
   at gnu.xml.stream.XMLParser.error(libgcj.so.8rh)
   at gnu.xml.stream.XMLParser.require(libgcj.so.8rh)
   at gnu.xml.stream.XMLParser.readEndElement(libgcj.so.8rh)
   at gnu.xml.stream.XMLParser.next(libgcj.so.8rh)
   at 
 org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:323)
   at 
 org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:197)
   at 
 org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:125)
   at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:128)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1143)
   at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
   at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:272)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
   at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
   at 
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
   at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
   at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
   at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
   at 
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
   at 
 org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
   at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
   at org.mortbay.jetty.Server.handle(Server.java:285)
   at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
   at 
 org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
   at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
   at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
   at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
   at 
 org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
   at 
 org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)

 2008-08-15 16:57:08.440::WARN:  EXCEPTION
 java.lang.NullPointerException
   at org.mortbay.io.bio.SocketEndPoint.getRemoteAddr(SocketEndPoint.java:116)
   at org.mortbay.jetty.Request.getRemoteAddr(Request.java:746)
   at org.mortbay.jetty.NCSARequestLog.log(NCSARequestLog.java:230)
   at 
 org.mortbay.jetty.handler.RequestLogHandler.handle(RequestLogHandler.java:51)
   at 
 org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
   at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
   at org.mortbay.jetty.Server.handle(Server.java:285)
   at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
   at 
 org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
   at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
   at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
   at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
   at 
 org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
   at 
 org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)


 On Fri, Aug 15, 2008 at 8:26 AM, Doug Steigerwald
 [EMAIL PROTECTED] wrote:
 We actually have this same exact issue on 5 of our cores.  We're just going
 to wipe the index and reindex soon, but it isn't actually causing any
 problems for us.  We can update the index just fine, there's just no merging
 going on.

 Ours happened when I reloaded all of our cores for a schema change.  I don't
 do that any more ;).

 Doug

 On Aug 14, 2008, at 11:08 PM, Yonik Seeley wrote:

 Since this looks like more of a lucene issue, I've replied in
 [EMAIL PROTECTED]

 -Yonik

 On Thu, Aug 14, 2008 at 10:18 PM, Ian Connor [EMAIL PROTECTED] wrote:

 I seem to be able to reproduce this very easily and the data is
 medline (so I am sure I can share it if needed with a quick email to
 check).

 - I am using fedora:
 %uname -a
 Linux ghetto5.projectlounge.com 2.6.23.1-42.fc8 #1 SMP Tue Oct 30
 13:18:33 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
 %java -version
 java version 1.7.0
 IcedTea Runtime Environment (build 1.7.0-b21)
 IcedTea 64-Bit Server 

Re: Can I copy an index built on a Windows system to a Unix/Linux system?

2008-08-15 Thread Erick Erickson
I've done exactly this many times in straight Lucene. Since Solr is built
on Lucene, I wouldn't anticipate any problems.

Make sure your transfer is binary mode...

Best
Erick

On Fri, Aug 15, 2008 at 8:02 AM, johnwarde [EMAIL PROTECTED] wrote:


 Hi,

 Can I copy an index built on a Windows system to a Unix/Linux system and
 still work?

 Reason for my question:
 I have been working with Solr for the last month on a Windows system and I
 have determined that we need to have a replication solution for our future
 needs (volume of documents to be indexed and query loads).

 At this point in time it looks like, from my research, that Solr does not
 currently provide a reliable/tested replication strategy on Windows.

 However, I would like to continue to use Solr on Windows for now until the
 load on the single windows system becomes too great and requires us to
 implement a replication strategy (one index master, many query slaves).
 Hopefully, by that time a reliable replication strategy on Windows may
 present itself but if it doesn't ...

 Can I make a binary copy of the index files from a windows system to a
 Unix/Linux system and be read by a Solr on the Unix/Linux system.  Would
 there be any byte order problems? Or would I need to rebuild the index from
 the original data?

 Many thanks for your help!

 John



 --
 View this message in context:
 http://www.nabble.com/Can-I-copy-an-index-built-on-a-Windows-system-to-a-Unix-Linux-system--tp18997540p18997540.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: Can I copy an index built on a Windows system to a Unix/Linux system?

2008-08-15 Thread johnwarde

Excellent! Many thanks for your help Eric!

John


Erick Erickson wrote:
 
 I've done exactly this many times in straight Lucene. Since Solr is built
 on Lucene, I wouldn't anticipate any problems.
 
 Make sure your transfer is binary mode...
 
 Best
 Erick
 
 On Fri, Aug 15, 2008 at 8:02 AM, johnwarde [EMAIL PROTECTED] wrote:
 

 Hi,

 Can I copy an index built on a Windows system to a Unix/Linux system and
 still work?

 Reason for my question:
 I have been working with Solr for the last month on a Windows system and
 I
 have determined that we need to have a replication solution for our
 future
 needs (volume of documents to be indexed and query loads).

 At this point in time it looks like, from my research, that Solr does not
 currently provide a reliable/tested replication strategy on Windows.

 However, I would like to continue to use Solr on Windows for now until
 the
 load on the single windows system becomes too great and requires us to
 implement a replication strategy (one index master, many query slaves).
 Hopefully, by that time a reliable replication strategy on Windows may
 present itself but if it doesn't ...

 Can I make a binary copy of the index files from a windows system to a
 Unix/Linux system and be read by a Solr on the Unix/Linux system.  Would
 there be any byte order problems? Or would I need to rebuild the index
 from
 the original data?

 Many thanks for your help!

 John



 --
 View this message in context:
 http://www.nabble.com/Can-I-copy-an-index-built-on-a-Windows-system-to-a-Unix-Linux-system--tp18997540p18997540.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 

-- 
View this message in context: 
http://www.nabble.com/Can-I-copy-an-index-built-on-a-Windows-system-to-a-Unix-Linux-system--tp18997540p18999382.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Indexing Only Parts of HTML Pages

2008-08-15 Thread Otis Gospodnetic
Hi Nick,

Yes, sounds like either custom Nutch parsing code or custom HTML parser that 
has the logic you described and feeds Solr with docs constructed based on this 
logic.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Nick Tkach [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Wednesday, August 13, 2008 12:44:58 PM
 Subject: Indexing Only Parts of HTML Pages
 
 I'm wondering, is there some way (out of the box) to tell Solr that 
 we're only interested in indexing certain parts of a page?  For example, 
 let's say I have a bunch of pages in my site that contain some common 
 navigation elements, roughly like this:
 
 
   
   
 

Stuff here about parts of my site
 
 

More stuff about other parts of the site
 
  A bunch of stuff particular to each individual page...
   
 
 
 Is there some way to either tell Solr to not index what's in the two 
 divs whenever it encounters them (and it will-in nearly every page) or, 
 failing that, to somehow easily give content in those areas a large 
 negative score in order to get the same effect?
 
 FWIW, we are using Nutch to do the crawling, but as I understand it 
 there's no way to get Nutch to skip only parts of pages without writing 
 custom code, right?



Re: Index size vs. number of documents

2008-08-15 Thread Phillip Farber

By Index size almost never grows linearly with the number of
documents are you saying it increases more slowly that the number of 
documents, i.e. sub-linearly or more rapidly?


With dirty OCR the number of unique terms is always increasing due to 
the garbage words


-Phil

Chris Hostetter wrote:

:  I'm surprised, as you are, by the non-linearity. Out of curiosity, what is

Unless the data in stored fields is significantly greater then indexed 
fields the Index size almost never grows linearly with the number of 
documents -- it's the number of unique terms that tends to primarily 
influence the size of the index.


At some point someone on the java-user list who really understood the file 
formats wrote a really great forumla for estimating the size of the index 
assuming some ratios of unique terms per doc, but i can't find it now.



-Hoss



Re: Shard searching clarifications

2008-08-15 Thread Yonik Seeley
On Fri, Aug 15, 2008 at 12:34 PM, Phillip Farber [EMAIL PROTECTED] wrote:
 If I have 2 solr instances (solr1 and solr2) each serving a shard
 is it correct I only need to send my query to one of the shards, e.g.

 solr1:8080/select?shards=solr1,solr2 ...

 and that I'll get merged results over both shards returned to be by solr1?

Yes.

 The other question is: can I query each instance in non-shard mode, i.e.
 just as

 solr1:8080/select? ... or solr2:8080/select? ...

 if I'm only interested in the documents in one of the shards?

Yes.

-Yonik


Re: Index size vs. number of documents

2008-08-15 Thread Otis Gospodnetic
Here's an example.
Consider 2 docs with terms:

doc1: term1, term2, term3
doc2: term4, term5, term6

vs.

doc1: term1, term2, term3
doc2: term1, term1, term6

All other things constant, the former will make index grow faster because it 
has more unique terms.  Even if your OCR has garbage that makes noise in form 
of new unique terms, there will still be some overlap (like that term1 in the 
second case above).

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Phillip Farber [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Friday, August 15, 2008 12:22:30 PM
 Subject: Re: Index size vs. number of documents
 
 By Index size almost never grows linearly with the number of
 documents are you saying it increases more slowly that the number of 
 documents, i.e. sub-linearly or more rapidly?
 
 With dirty OCR the number of unique terms is always increasing due to 
 the garbage words
 
 -Phil
 
 Chris Hostetter wrote:
  :  I'm surprised, as you are, by the non-linearity. Out of curiosity, what 
  is
  
  Unless the data in stored fields is significantly greater then indexed 
  fields the Index size almost never grows linearly with the number of 
  documents -- it's the number of unique terms that tends to primarily 
  influence the size of the index.
  
  At some point someone on the java-user list who really understood the file 
  formats wrote a really great forumla for estimating the size of the index 
  assuming some ratios of unique terms per doc, but i can't find it now.
  
  
  -Hoss
  



Re: Highlighting returns incorrect text on some results?

2008-08-15 Thread pdovyda2

Thanks Otis.  I downloaded the nightly today and reindexed, and it seems that
it was a bug that you've worked out since 1.2 as I don't see the issue
anymore.

Paul


Otis Gospodnetic wrote:
 
 Paul, we had many highlighter-related changes since 1.2, so I suggest you
 try the nightly.
 
 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
 - Original Message 
 From: pdovyda2 [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Thursday, August 14, 2008 2:56:42 PM
 Subject: Highlighting returns incorrect text on some results?
 
 
 This is kind of a strange issue, but when I submit a query and ask for
 highlighting back, sometimes the highlighted text includes a question
 mark
 at the beginning, although a question mark character does not appear in
 the
 field that the highlighted text is taken from.
 
 I've put some sample XML output on the web at
 http://ucair.cs.uiuc.edu/pdovyda2/problem.xml
 If you look at the first and third highlights, you'll see what I'm
 talking
 about.  
 
 Besides looking a bit odd, it is causing my application to break because
 the
 highlighted field is multivalued, and I was doing text matching to
 determine
 which of the values was chosen for highlighting.
 
 Is this actually a bug, or have I just misconfigured something?  By the
 way,
 I am using the 1.2 release, I have not yet tried out a nightly build to
 see
 if this is an old problem.
 
 Thanks,
 Paul
 -- 
 View this message in context: 
 http://www.nabble.com/Highlighting-returns-incorrect-text-on-some-results--tp18987598p18987598.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Highlighting-returns-incorrect-text-on-some-results--tp18987598p19002545.html
Sent from the Solr - User mailing list archive at Nabble.com.



partialResults, distributed search SOLR-502

2008-08-15 Thread Brian Whitman

I was going to file a ticket like this:

A SOLR-303 query with shards=host1,host2,host3 when host3 is down  
returns an error. One of the advantages of a shard implementation is  
that data can be stored redundantly across different shards, either as  
direct copies (e.g. when host1 and host3 are snapshooter'd copies of  
each other) or where there is some data RAID that stripes indexes  
for redundancy.


But then I saw SOLR-502, which appears to be committed.

If I have the above scenario (host1,host2,host3 where host3 is not up)  
and set a timeAllowed, will I still get a 400 or will it come back  
with partial results? If not, can we think of a way to get this to  
work? It's my understanding already that duplicate docIDs are merged  
in the SOLR-303 response, so other than building in some this host  
isn't working, just move on and report it and of course the work to  
index redundantly, we wouldn't need anything to achieve a good  
redundant shard implementation.


B




Auto commit error and java.io.FileNotFoundException

2008-08-15 Thread Chris Harris
I have an index (different from the ones mentioned yesterday) that was
working fine with 3M docs or so, but when I added a bunch more docs,
bringing it closer to 4M docs, the index seemed to get corrupted. In
particular, now when I start Solr up, or when when my indexing process
tries add a document, I get a complaint about missing index files.

The error on startup looks like this:

record
  date2008-08-15T10:18:54/date
  millis1218820734592/millis
  sequence92/sequence
  loggerorg.apache.solr.core.MultiCore/logger
  levelSEVERE/level
  classorg.apache.solr.common.SolrException/class
  methodlog/method
  thread10/thread
  messagejava.lang.RuntimeException: java.io.FileNotFoundException:
/ssd/solr-/solr/exhibitcore/data/index/_p7.fdt (No such file or
directory)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:733)
at org.apache.solr.core.SolrCore.lt;initgt;(SolrCore.java:387)
at org.apache.solr.core.MultiCore.create(MultiCore.java:255)
at org.apache.solr.core.MultiCore.load(MultiCore.java:139)
at 
org.apache.solr.servlet.SolrDispatchFilter.initMultiCore(SolrDispatchFilter.java:147)
at 
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:75)
at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at 
org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:594)
at org.mortbay.jetty.servlet.Context.startContext(Context.java:139)
at 
org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1218)
at 
org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:500)
at 
org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at 
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:161)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at 
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at 
org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:117)
at org.mortbay.jetty.Server.doStart(Server.java:210)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.mortbay.start.Main.invokeMain(Main.java:183)
at org.mortbay.start.Main.start(Main.java:497)
at org.mortbay.start.Main.main(Main.java:115)
Caused by: java.io.FileNotFoundException:
/ssd/solr-/solr/exhibitcore/data/index/_p7.fdt (No such file or
directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.lt;initgt;(RandomAccessFile.java:233)
at 
org.apache.lucene.store.FSDirectory$FSIndexInput$Descriptor.lt;initgt;(FSDirectory.java:506)
at 
org.apache.lucene.store.FSDirectory$FSIndexInput.lt;initgt;(FSDirectory.java:536)
at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:445)
at 
org.apache.lucene.index.FieldsReader.lt;initgt;(FieldsReader.java:75)
at 
org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:308)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:262)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:197)
at 
org.apache.lucene.index.MultiSegmentReader.lt;initgt;(MultiSegmentReader.java:55)
at 
org.apache.lucene.index.DirectoryIndexReader$1.doBody(DirectoryIndexReader.java:75)
at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:636)
at 
org.apache.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:209)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:173)
at 
org.apache.solr.search.SolrIndexSearcher.lt;initgt;(SolrIndexSearcher.java:93)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:724)
... 29 more
/message
/record

And the error on doc add looks like this:

record
  date2008-08-15T09:51:30/date
  millis1218819090142/millis
  sequence6571937/sequence
  loggerorg.apache.solr.core.SolrCore/logger
  levelSEVERE/level
  classorg.apache.solr.common.SolrException/class
  

Re: Can I copy an index built on a Windows system to a Unix/Linux system?

2008-08-15 Thread Noble Paul നോബിള്‍ नोब्ळ्
There is a (SOLR-561) feature getting built for doing replication in
any platform . The patch works and it is tested. Do not expect it to
work with the current trunk because a lot has changed in  trunk since
the last patch . We will be updating it soon once the dust settles
down.
-

On Fri, Aug 15, 2008 at 7:45 PM, johnwarde [EMAIL PROTECTED] wrote:

 Excellent! Many thanks for your help Eric!

 John


 Erick Erickson wrote:

 I've done exactly this many times in straight Lucene. Since Solr is built
 on Lucene, I wouldn't anticipate any problems.

 Make sure your transfer is binary mode...

 Best
 Erick

 On Fri, Aug 15, 2008 at 8:02 AM, johnwarde [EMAIL PROTECTED] wrote:


 Hi,

 Can I copy an index built on a Windows system to a Unix/Linux system and
 still work?

 Reason for my question:
 I have been working with Solr for the last month on a Windows system and
 I
 have determined that we need to have a replication solution for our
 future
 needs (volume of documents to be indexed and query loads).

 At this point in time it looks like, from my research, that Solr does not
 currently provide a reliable/tested replication strategy on Windows.

 However, I would like to continue to use Solr on Windows for now until
 the
 load on the single windows system becomes too great and requires us to
 implement a replication strategy (one index master, many query slaves).
 Hopefully, by that time a reliable replication strategy on Windows may
 present itself but if it doesn't ...

 Can I make a binary copy of the index files from a windows system to a
 Unix/Linux system and be read by a Solr on the Unix/Linux system.  Would
 there be any byte order problems? Or would I need to rebuild the index
 from
 the original data?

 Many thanks for your help!

 John



 --
 View this message in context:
 http://www.nabble.com/Can-I-copy-an-index-built-on-a-Windows-system-to-a-Unix-Linux-system--tp18997540p18997540.html
 Sent from the Solr - User mailing list archive at Nabble.com.





 --
 View this message in context: 
 http://www.nabble.com/Can-I-copy-an-index-built-on-a-Windows-system-to-a-Unix-Linux-system--tp18997540p18999382.html
 Sent from the Solr - User mailing list archive at Nabble.com.





-- 
--Noble Paul


Re: Auto commit error and java.io.FileNotFoundException

2008-08-15 Thread Chris Harris
I've done some more sniffing on the Lucene list, and noticed that Otis
made the following comment about a FileNotFoundException problem in
late 2005:

Are you using Windows and a compound index format (look at your index
dir - does it have .cfs file(s))?

This may be a bad combination, judging from people who reported this
problem so far.

(http://www.nabble.com/fnm-file-disappear-td1531775.html#a1531775)

Again, a CFS index was indeed involved in my case, but my experience
comes almost three years after Otis' message...

On Fri, Aug 15, 2008 at 10:35 AM, Chris Harris [EMAIL PROTECTED] wrote:

 The following may or may not be relevant: I built the base 3M-ish doc
 index on a Windows machine, and it's a compound (.cfs) format index.
 (I actually created it not with Solr, but by using the index merging
 tool that comes with Lucene in order to merge three different
 non-compound format indexes that I'd previously made with Solr into a
 single index.) Before I started adding documents, I moved the index to
 a Linux machine running a newer version of Solr/Lucene than was on the
 Windows machine. The stuff described above all happened on Linux.

 Any thoughts?

 Thanks a bunch,
 Chris



Re: Administrative questions

2008-08-15 Thread Jon Drukman

Jason Rennie wrote:

On Wed, Aug 13, 2008 at 1:52 PM, Jon Drukman [EMAIL PROTECTED] wrote:


Duh.  I should have thought of that.  I'm a big fan of djbdns so I'm quite
familiar with daemontools.

Thanks!



:)  My pleasure.  Was nice to hear recently that DJB is moving toward more
flexible licensing terms.  For anyone unfamiliar w/ daemontools, here's
DJB's explanation of why they rock compared to inittab, ttys, init.d, and
rc.local:

http://cr.yp.to/daemontools/faq/create.html#why


in case anybody wants to know, here's how to run solr under daemontools.

1. install daemontools
2. create /etc/solr
3. create a user and group called solr
4. create shell script /etc/solr/run  (edit to taste, i'm using the 
default jetty that comes with solr)


#!/bin/sh
exec 21
cd /usr/local/apache-solr-1.2.0/example
exec setuidgid solr java -jar start.jar


4. create /etc/solr/log/run containing:

#!/bin/sh
exec setuidgid solr multilog t ./main

5. ln -s /etc/solr /service/solr

that is all.  as long as you've got svscan set to launch when the system 
boots, solr will run and auto-restart on crashes.  logs will be in 
/service/solr/log/main (auto-rotated).


yay.
-jsd-



failover sharding

2008-08-15 Thread Ian Connor
Hi,

Is there a way to put a timeout or have some way of ignoring shards
that are not there? For instance, I have 4 shards, and they have
overlap with the documents for redundancy.

shard 1 = 0-200
shard 2 = 100-400
shard 3 = 300-600
shard 4 = 500-600  0-100

This means if one of my shards goes down, then I can still give
results. If there was some option that said wait 1 second and then
give up, this would work perfectly for me.

-- 
Regards,

Ian Connor


Solr Cache

2008-08-15 Thread Tim Christensen
We have two servers, with the same index load balanced. The indexes  
are updated at the same time every day. Occasionally, a search on one  
server will return different results from the other server, even  
though the data used to create the index is exactly the same.


Is this possibly due to caching? Does the cache reset automatically  
after the commit?


The problem usually resolves itself - by all appearances, randomly,  
but I assume something I don't know is going on such as a new searcher  
starting up for example at some point in the day. All cache settings  
are the solrconfig defaults.


Thank you ahead of time.


Tim Christensen
Director Media  Technology
Vann's Inc.
406-203-4656

[EMAIL PROTECTED]

http://www.vanns.com