Re: IndexWriter memory leak?

2010-04-08 Thread Ruben Laguna
But the Readers I'm talking about are not held by the Tokenizer (at least not *only* by it), these are held by the DocFieldProccessorPerThread IndexWriter -> DocumentsWriter -> DocumentsWriterThreadState -> DocFieldProcessorPerThread -> DocFieldProcessorPerField -> Fieldable -> Field (fieldsDa

Re: Lucene Partition Size

2010-04-08 Thread Michael McCandless
On Thu, Apr 8, 2010 at 2:44 PM, Karl Wettin wrote: > > 8 apr 2010 kl. 20.05 skrev Ivan Provalov: > >> We are using Lucene for searching of 200+ mln documents (periodical >> publications).  Is there any limitation on the size of the Lucene index >> (file size, number of docs, etc...)? > > The only

Re: ClosedChannelException from IndexWriter.getReader()

2010-04-08 Thread Justin
>From an architecture standpoint, wait/notify does require extra logic to catch >any notify calls while a searcher is being replaced. Using interrupt() was >quite convenient at insuring the searcher was up-to-date. - Original Message From: Simon Willnauer To: java-user@lucene.apach

Re: ClosedChannelException from IndexWriter.getReader()

2010-04-08 Thread Simon Willnauer
Argh! one more running into this issue. It still bugs me that NIOFSDirectory struggles so badly if interrupt is used. simon On Thu, Apr 8, 2010 at 11:19 PM, Justin wrote: > We have a custom IndexSearcher that fetches a near real-time reader and calls > FieldCache.DEFAULT.getStrings() after a c

RE: IndexWriter memory leak?

2010-04-08 Thread Uwe Schindler
There is one possibility, that could be fixed: As Tokenizers are reused, the analyzer holds a reference to the last used Reader. The easy fix would be to unset the Reader in Tokenizer.close(). If this is the case for you, that may be easy to do. So Tokenizer.close() looks like this: /** By d

Re: ClosedChannelException from IndexWriter.getReader()

2010-04-08 Thread Justin
We have a custom IndexSearcher that fetches a near real-time reader and calls FieldCache.DEFAULT.getStrings() after a calculated length of time or when certain changes are made to the index (requiring immediate searchability). The thread slept for that length of time unless an interrupt was giv

Re: ClosedChannelException from IndexWriter.getReader()

2010-04-08 Thread Michael McCandless
OK, phew :) Yea warming in a separate thread is common... but why does Thread.interrupt() come into play in your app for warming? Mike On Thu, Apr 8, 2010 at 4:38 PM, Justin wrote: > In fact, we are using Thread.interrupt() to warm up a searcher in a separate > thread (not really that uncommon

Re: ClosedChannelException from IndexWriter.getReader()

2010-04-08 Thread Justin
In fact, we are using Thread.interrupt() to warm up a searcher in a separate thread (not really that uncommon, is it?). We may switch to Object::wait(long) and Object::notify() instead of switching the Directory implementation. Thanks for recognizing the issue! - Original Message

Re: IndexWriter memory leak?

2010-04-08 Thread Ruben Laguna
Yeah, I checked again and IndexWriter is holding references to the Reader, I'm afraid. I opened bug report https://issues.apache.org/jira/browse/LUCENE-2387 to track this down. On Thu, Apr 8, 2010 at 2:50 PM, Ruben Laguna wrote: > I will double check in the afternoon the heapdump.hprof. But I

Re: Lucene Partition Size

2010-04-08 Thread Ivan Provalov
Karl, We have not done the same scale local-disk test. Our network parameters are - Network speed: 1gb - 3 partitions per volume - The volumes are accessed via NFS to EMC Celera devices. (NFS 3) - The drives are 300 gb fiber attached with 10,000 rpm. Thanks, Ivan --- On Thu, 4/8/10, Karl

Re: ClosedChannelException from IndexWriter.getReader()

2010-04-08 Thread Michael McCandless
Are you using Future.cancel or directly using Thread.interrupt? If so it could be this nasty issue: https://issues.apache.org/jira/browse/LUCENE-2239 Try temporarily using a Directory impl other than NIOFSDirectory and see if the problem still happens? Mike On Thu, Apr 8, 2010 at 2:14 PM,

Re: Lucene Partition Size

2010-04-08 Thread Karl Wettin
8 apr 2010 kl. 20.05 skrev Ivan Provalov: We are using Lucene for searching of 200+ mln documents (periodical publications). Is there any limitation on the size of the Lucene index (file size, number of docs, etc...)? The only such limitation in Lucene I'm aware of is Integer.MAX_VALUE

ClosedChannelException from IndexWriter.getReader()

2010-04-08 Thread Justin
I'm getting a ClosedChannelException from IndexWriter.getReader(). I don't think the writer has been closed and, if it were, I would expect an AlreadyClosedException as described in the API documentation. Does anyone have an idea what might be wrong? The disk is not full and the permissions l

Lucene Partition Size

2010-04-08 Thread Ivan Provalov
We are using Lucene for searching of 200+ mln documents (periodical publications). Is there any limitation on the size of the Lucene index (file size, number of docs, etc...)? We are partitioning the indexes at about 10 mln documents per partition (each partition is on a separate box, some m

Re: Similarity based on regexp

2010-04-08 Thread Michael McCandless
You can use RegexQuery (from contrib/regex) for this? (In 3.1 there's a higher performance, very similar, RegexpQuery, too). Mike On Thu, Apr 8, 2010 at 10:10 AM, Hans-Henning Gabriel wrote: > Hello everybody, > > this is what I would like to do: > I have an index with documents containing a fi

Similarity based on regexp

2010-04-08 Thread Hans-Henning Gabriel
Hello everybody, this is what I would like to do: I have an index with documents containing a field 'authors'. I would like to find all documents that have authors similar to a given author-string. One could do this by a special query, relying on lucenes scoring/ranking mechanism. But I would l

Highlighting search results from files

2010-04-08 Thread Saju Pillai
Hello, I am new to Lucene. I am trying to highlight results for files on disk. The file content is indexed as : Reader freader = new FileReader(filepath); doc.add(new Field("content", freader)); In the Highlighter.getBestFragments(tokenStream, text, .) api: 1) is tokenStream == analy

Re: IndexWriter memory leak?

2010-04-08 Thread Ruben Laguna
I will double check in the afternoon the heapdump.hprof. But I think that *some* readers are indeed held by docWriter.threadStates[0].consumer.fieldHash[1].fields[], as shown in [1] (this heapdump contains only live objects). The heapdump was taken after IndexWriter.commit() /IndexWriter.optim

RE: IndexWriter memory leak?

2010-04-08 Thread Uwe Schindler
Readers are not held. If you indexed the document and gced the document instance they readers are gone. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Ruben Laguna [mailto:ruben.lag...@gmail.com] > Sen

Re: IndexWriter memory leak?

2010-04-08 Thread Ruben Laguna
And by the way, when is Lucene 3.1 coming? On Thu, Apr 8, 2010 at 1:27 PM, Ruben Laguna wrote: > Now that the zzBuffer issue is solved... > > what about the references to the Readers held by docWriter. Tika´s > ParsingReaders are quite heavyweight so retaining those in memory > unnecesarily is a

Re: IndexWriter memory leak?

2010-04-08 Thread Ruben Laguna
Now that the zzBuffer issue is solved... what about the references to the Readers held by docWriter. Tika´s ParsingReaders are quite heavyweight so retaining those in memory unnecesarily is also a "hidden" memory leak. Should I open a bug report on that one? /Rubén On Thu, Apr 8, 2010 at 12:11 P

Berlin Buzzwords - early registration extended

2010-04-08 Thread Isabel Drost
Hello, we would like to invite everyone interested in data storage, analysis and search to join us for two days on June 7/8th in Berlin for an in-depth, technical, developer-focused conference located in the heart of Europe. Presentations will range from beginner friendly introductions on the

Re: IndexWriter memory leak?

2010-04-08 Thread Shai Erera
Guess we were replying at the same time :). On Thu, Apr 8, 2010 at 1:04 PM, Uwe Schindler wrote: > I already answered, that I will take care of this! > > Uwe > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > > -Original

RE: IndexWriter memory leak?

2010-04-08 Thread Uwe Schindler
I responded, because the mentioned issue will change the whole class structure in the standard package, so any patch would get outdated, soon. So it’s the best to add it directly there. But if you try out, if it works, that’s fine. The fix would be in 3.1 so if you need to fix your 3.0.1 versio

Re: IndexWriter memory leak?

2010-04-08 Thread Ruben Laguna
That was fast! I was already writting a patch... just to see if it works. On Thu, Apr 8, 2010 at 12:02 PM, Uwe Schindler wrote: > Hi Shai, hi Ruben, > > I will take care of this in > https://issues.apache.org/jira/browse/LUCENE-2074 where some parts of the > Tokenizer impl are rewritten. > > ---

RE: IndexWriter memory leak?

2010-04-08 Thread Uwe Schindler
I already answered, that I will take care of this! Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Shai Erera [mailto:ser...@gmail.com] > Sent: Thursday, April 08, 2010 12:00 PM > To: java-user@luce

RE: IndexWriter memory leak?

2010-04-08 Thread Uwe Schindler
Hi Shai, hi Ruben, I will take care of this in https://issues.apache.org/jira/browse/LUCENE-2074 where some parts of the Tokenizer impl are rewritten. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Ru

Re: IndexWriter memory leak?

2010-04-08 Thread Shai Erera
Yes, that's the trimBuffer version I was thinking about, only this guy created a reset(Reader, int) and does both ops (resetting + trim) in one method call. More convenient. Can you please open an issue to track that? People will have a chance to comment on whether we (Lucene) should handle that, o

Re: IndexWriter memory leak?

2010-04-08 Thread Ruben Laguna
I was investigating this a little further and in the JFlex mailing list I found [1] I don't know much about flex / JFlex but it seems that this guy resets the zzBuffer to 16384 or less when setting the input for the lexer Quoted from shef I set %buffer 0 in the options section, and then ad

Re: custom low-level indexer (to speed things up) when fields, terms and docids are in order

2010-04-08 Thread Michael McCandless
Very interesting! Newer versions of Lucene have cutover to dedicated utility class (oal.util.StringHelper) for faster interning w/ threads. I wonder if that'd help your case which Lucene version are you using? Thanks for bringing closure, Mike On Wed, Apr 7, 2010 at 3:09 PM, britske wrote

Re: IndexWriter memory leak?

2010-04-08 Thread Shai Erera
If we could change the Flex file so that yyreset(Reader) would check the size of zzBuffer, we could trim it when it gets too big. But I don't think we have such control when writing the flex syntax ... yyreset is generated by JFlex and that's the only place I can think of to trim the buffer down wh

RE: IndexWriter memory leak?

2010-04-08 Thread Uwe Schindler
> I would like to identify also the problematic document I have 1 so, > what > would be the best way of identifying the one that it making zzBuffer to > grow > without control? Dont index your documents, but instead pass them directly to the analyzer and consume the tokenstream manually. Then

RE: IndexWriter memory leak?

2010-04-08 Thread Uwe Schindler
Hi Ruben, as Shai already pointed out, the buffer with this large size is hold by "StandardTokenizer", which is used in the "StandardAnalyzer". This code is out of Lucene's control, as it is generated by the jFlex library. As long as the IndexWriter instance is living, the buffer is hold implic

Re: IndexWriter memory leak?

2010-04-08 Thread Ruben Laguna
I'm using StandardAnalyzer. I indeed parse large documents, xml and pdfs, using nekohtml and tika respectively. I took a look to the zzBuffer value contents (by exporting it to a file with Eclipse MAT from the heapdump) and it seems to contain normal text from several documents. See below cat he