Re: IndexWriter and memory usage

2010-05-10 Thread Michael McCandless
H...

Your usage (searching for old doc  updating it, to add new fields) is fine.

But: what memory usage do you see if you open a searcher, and search
for all docs, but don't open an IndexWriter?  We need to tease apart
the IndexReader vs IndexWriter memory usage you are seeing.  Also, can
you post the output of CheckIndex (java
org.apache.lucene.index.CheckIndex /path/to/index) of your fully built
index?  That may give some hints about expected memory usage of IR (eg
if # unique terms is large).

More comments below:

On Thu, May 6, 2010 at 1:03 PM, Woolf, Ross ross_wo...@bmc.com wrote:
 Sorry to be so long in getting back on this. The patch you provided has 
 improved the situation but we are still seeing some memory loss.  The 
 following are some images from the heap dump.  I'll share with you what we 
 are seeing now.

 This first image shows the memory pattern.  Our fist commit takes place at 
 about 3:54 when the steady trend up takes a drop and the new cycle begins.   
 What we have found is the 2422 fix has made the memory in the first phase 
 before the commit much better (and I'm sure throughout the entire run).  But 
 as you can see after the commit then we then again begin to lose memory.  One 
 of the pieces of info to know about this is what you are seeing, we have 5 
 threads that are pushing data to our Lucene plugin.  If we drop it down to 1 
 thread then we are much more successful and can actually index all of our 
 data without running out of memory but at 5 threads it gets into trouble.  We 
 still see a trend up in memory usage, but not as severe as when we use the 
 multiple threads.
 http://tinypic.com/view.php?pic=2w6bf68s=5

Can you post the output of svn diff on the 2.9 code base you're
using?  I just want to look  verify all issues we've discussed are
included in your changes.  The fact that 1 thread is fine and 5
threads are not still sounds like a symptom of LUCENE-2283.

Also, does that heap usage graph exclude garbage?  Or, alternatively,
can you provoke an OOME w/ 512 MB heap and then capture the heap dump
at that point?

 There is another piece of the picture that I think might be coming into play. 
  We have plugged Lucene into a legacy app and are subject to how we can get 
 it to deliver the data that we are indexing.  In some scenarios (like the one 
 we are having this problem with) we are building our documents progressively 
 (adding fields to the document through the process).  What you see before the 
 first commit is the legacy system handing us the first field for many 
 documents. Once we have gotten all of field 1 for all documents then we 
 commit that data into the index.  Then the system starts feeding us field 
 2.  So we perform a search to see if the document already exists (for the 
 scenario you are seeing it does) and so it retrieves the original document 
 (we store a document ID) and it then adds the new field of data to the 
 existing document and we update the document in the index.  After the first 
 commit, the rest of the process is one where a document already exist and so 
 the new field is added and and the document is updated.  It is in this 
 process that we start rapidly losing memory.  The following images show some 
 examples of common areas where memory is being held.

 http://tinypic.com/view.php?pic=11vkwnbs=5

This looks like normal memory usage of IndexWriter -- these are the
recycled buffers used for holding stored fields.  However: the net RAM
used by this allocation should not exceed your 16 MB IW ram buffer
size -- does it?

 http://tinypic.com/view.php?pic=abq9fps=5

This one is the byte[] buffer used by CompoundFileReader, opened by
IndexReader.  It's odd that you have so many of these (if I'm reading
this correctly) -- are you certain all opened readers are being
closed?  How many segments do you have in your index?  Or... are there
many unique threads doing the searching?  EG do you create a new
thread for every search or update?

 http://tinypic.com/view.php?pic=25pskyps=5

This one is also normal memory used by IndexWriter, but as above, the
net RAM used by this allocation (summed w/ the above one) should not
exceed your 16 MB IW ram buffer size.

 As mentioned, we are subject to how we can have the legacy app feed us the 
 data and so this is why we do it this way.  We treat this system as a real 
 time system and at anytime the legacy system may send us a field that needs 
 to be added or updated to a document.  So we search for the document and if 
 found we either add or update a field if the field is already existing in the 
 document.  So I started to wonder if a clue in this memory loss comes from 
 the fact that we are retrieving an existing document and then adding to it 
 and updating.

 Now, if we eliminate the updating and simply add each item as a new document 
 (which we did just to test but won't be adequate for our running system), 
 then we still see a slight trend upward in memory usage and the 

The best way to stop indexing quickly?

2010-05-10 Thread alx27 alx27
Hi all,

Imagine a situation: Lucene started indexing a huge file, and just after
this user demands the application to be shut down immediately. What would be
the recommended way of doing this, so that application shuts down within
seconds, but with least possible damage to the index?

best regards,
Alex


Re: Trace only exactly matching terms!

2010-05-10 Thread manjula wijewickrema
Hi Anshum  Erick,

As you have mentioned, I used SnowballAnalyzer for stemming purposes. It
worked nicely. Thnks a lot for your guidence.

Manjula.

On Fri, May 7, 2010 at 8:27 PM, Erick Erickson erickerick...@gmail.comwrote:

 The other approach is to use a stemmer both at index and query time.

 BTW, it's very easy to make a custom analyzer by chaining together
 the Tokenizer and as many filters (e.g. PorterStemFilter), essentially
 composing your analyzer from various pre-built Lucene parts.

 HTH
 Erick

 On Fri, May 7, 2010 at 9:07 AM, Anshum ansh...@gmail.com wrote:

  Hi Manjula,
  Yes lucene by default would only tackle exact term matches unless you use
 a
  custom analyzer to expand the index/query.
 
  --
  Anshum Gupta
  http://ai-cafe.blogspot.com
 
  The facts expressed here belong to everybody, the opinions to me. The
  distinction is yours to draw
 
 
  On Fri, May 7, 2010 at 2:22 PM, manjula wijewickrema 
 manjul...@gmail.com
  wrote:
 
   Hi,
  
   I am using Lucene 2.9.1 . I have downloaded and run the
  'HelloLucene.java'
   class by modifing the input document and user query in various ways.
 Once
  I
   put the document sentenses as 'Lucene in actions' insted of 'Lucene in
   action', and I gave the query as 'action' and run the programme. But it
  did
   not show me the 'Lucene in action as a hit'! What is the reason for
 this?
   Why it doesn't tackle word 'actions' as a hit? Does Lucene identify
 only
   the
   exactly matching words?
  
   Thanks
   Manjula
  
 



Class_for_HighFrequencyTerms

2010-05-10 Thread manjula wijewickrema
Hi,

If I index a document (single document) in Lucene, then how can I get the
term frequencies (even the first and second highest occuring terms) of that
document? Is there any class/method to do taht? If anybody knows, pls. help
me.

Thanks
Manjula


MatchAllDocsQuery and MatchNoDocsQuery

2010-05-10 Thread Mindaugas Žakšauskas
Hi,

Can anybody confirm whether MatchAllDocsQuery can be used as an
immutable singletone? By this I mean creating a single instance and
sharing it whenever I need to either use it on its own or in
cojunction with other queries put into a BooleanQuery; to result all
documents in a search result. Can the same instance even be reused
among different threads?

What would be the best way implementing MatchNoDocsQuery? My initial
tests show that a new BooleanQuery() without any additional clauses
would just do the job, but I just wanted to double check whether this
is be a reliable assumption. Above questions also apply - could this
be reused among different contexts, threads?

Thanks in advance.

Regards,
Mindaugas

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: MatchAllDocsQuery and MatchNoDocsQuery

2010-05-10 Thread Yonik Seeley
Yes on all counts.  Lucene doesn't modify query objects, so they are
save for reuse among multiple threads.

-Yonik
Apache Lucene Eurocon 2010
18-21 May 2010 | Prague



2010/5/10 Mindaugas Žakšauskas min...@gmail.com:
 Hi,

 Can anybody confirm whether MatchAllDocsQuery can be used as an
 immutable singletone? By this I mean creating a single instance and
 sharing it whenever I need to either use it on its own or in
 cojunction with other queries put into a BooleanQuery; to result all
 documents in a search result. Can the same instance even be reused
 among different threads?

 What would be the best way implementing MatchNoDocsQuery? My initial
 tests show that a new BooleanQuery() without any additional clauses
 would just do the job, but I just wanted to double check whether this
 is be a reliable assumption. Above questions also apply - could this
 be reused among different contexts, threads?

 Thanks in advance.

 Regards,
 Mindaugas

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Class_for_HighFrequencyTerms

2010-05-10 Thread Erick Erickson
Have you looked at TermFreqVector?

Best
Erick

On Mon, May 10, 2010 at 8:10 AM, manjula wijewickrema
manjul...@gmail.comwrote:

 Hi,

 If I index a document (single document) in Lucene, then how can I get the
 term frequencies (even the first and second highest occuring terms) of that
 document? Is there any class/method to do taht? If anybody knows, pls. help
 me.

 Thanks
 Manjula



Re: merge results from physically separate hosts

2010-05-10 Thread Shaun Senecal
Sorry for the delayed response...

Thanks, thats what I thought.  In my case, the schema of each index
would be slightly different, so I would want to run a PrefixQuery
against each index (all fields in each index) using the same query
text.  Maybe I would be able to take the results from each index and
then simply sort based on the ScoreDoc or something to get the most
relevant docs.

Is there a technical reason why Solr requires the index schema to be
the same, or was this simply the design that was chosen?


Shaun


On Mon, Apr 26, 2010 at 6:59 PM, Erik Hatcher erik.hatc...@gmail.com wrote:
 Solr's distributed search feature is about querying multiple indexes and
 merging the results. Different indexes, but same schema.

        Erik

 On Apr 25, 2010, at 6:02 AM, Shaun Senecal wrote:

 Is there currently a way to take a query, run it on multiple hosts
 containing different indexes, then merge the results from each host to
 present to the user?  It looks like Solr can handle multiple hosts
 supporting the same index, but my case requires each index to be
 different.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org