But the Readers I'm talking about are not held by the Tokenizer (at least
not *only* by it), these are held by the DocFieldProccessorPerThread
IndexWriter -> DocumentsWriter -> DocumentsWriterThreadState ->
DocFieldProcessorPerThread -> DocFieldProcessorPerField -> Fieldable ->
Field (fieldsDa
On Thu, Apr 8, 2010 at 2:44 PM, Karl Wettin wrote:
>
> 8 apr 2010 kl. 20.05 skrev Ivan Provalov:
>
>> We are using Lucene for searching of 200+ mln documents (periodical
>> publications). Is there any limitation on the size of the Lucene index
>> (file size, number of docs, etc...)?
>
> The only
>From an architecture standpoint, wait/notify does require extra logic to catch
>any notify calls while a searcher is being replaced. Using interrupt() was
>quite convenient at insuring the searcher was up-to-date.
- Original Message
From: Simon Willnauer
To: java-user@lucene.apach
Argh! one more running into this issue.
It still bugs me that NIOFSDirectory struggles so badly if interrupt is used.
simon
On Thu, Apr 8, 2010 at 11:19 PM, Justin wrote:
> We have a custom IndexSearcher that fetches a near real-time reader and calls
> FieldCache.DEFAULT.getStrings() after a c
There is one possibility, that could be fixed:
As Tokenizers are reused, the analyzer holds a reference to the last used
Reader. The easy fix would be to unset the Reader in Tokenizer.close(). If this
is the case for you, that may be easy to do. So Tokenizer.close() looks like
this:
/** By d
We have a custom IndexSearcher that fetches a near real-time reader and calls
FieldCache.DEFAULT.getStrings() after a calculated length of time or when
certain changes are made to the index (requiring immediate searchability). The
thread slept for that length of time unless an interrupt was giv
OK, phew :)
Yea warming in a separate thread is common... but why does
Thread.interrupt() come into play in your app for warming?
Mike
On Thu, Apr 8, 2010 at 4:38 PM, Justin wrote:
> In fact, we are using Thread.interrupt() to warm up a searcher in a separate
> thread (not really that uncommon
In fact, we are using Thread.interrupt() to warm up a searcher in a separate
thread (not really that uncommon, is it?). We may switch to Object::wait(long)
and Object::notify() instead of switching the Directory implementation. Thanks
for recognizing the issue!
- Original Message
Yeah, I checked again and IndexWriter is holding references to the Reader,
I'm afraid.
I opened bug report https://issues.apache.org/jira/browse/LUCENE-2387 to
track this down.
On Thu, Apr 8, 2010 at 2:50 PM, Ruben Laguna wrote:
> I will double check in the afternoon the heapdump.hprof. But I
Karl,
We have not done the same scale local-disk test. Our network parameters are
- Network speed: 1gb
- 3 partitions per volume
- The volumes are accessed via NFS to EMC Celera devices. (NFS 3)
- The drives are 300 gb fiber attached with 10,000 rpm.
Thanks,
Ivan
--- On Thu, 4/8/10, Karl
Are you using Future.cancel or directly using Thread.interrupt? If so
it could be this nasty issue:
https://issues.apache.org/jira/browse/LUCENE-2239
Try temporarily using a Directory impl other than NIOFSDirectory and
see if the problem still happens?
Mike
On Thu, Apr 8, 2010 at 2:14 PM,
8 apr 2010 kl. 20.05 skrev Ivan Provalov:
We are using Lucene for searching of 200+ mln documents (periodical
publications). Is there any limitation on the size of the Lucene
index (file size, number of docs, etc...)?
The only such limitation in Lucene I'm aware of is Integer.MAX_VALUE
I'm getting a ClosedChannelException from IndexWriter.getReader(). I don't
think the writer has been closed and, if it were, I would expect an
AlreadyClosedException as described in the API documentation. Does anyone have
an idea what might be wrong? The disk is not full and the permissions l
We are using Lucene for searching of 200+ mln documents (periodical
publications). Is there any limitation on the size of the Lucene index (file
size, number of docs, etc...)?
We are partitioning the indexes at about 10 mln documents per partition (each
partition is on a separate box, some m
You can use RegexQuery (from contrib/regex) for this?
(In 3.1 there's a higher performance, very similar, RegexpQuery, too).
Mike
On Thu, Apr 8, 2010 at 10:10 AM, Hans-Henning Gabriel
wrote:
> Hello everybody,
>
> this is what I would like to do:
> I have an index with documents containing a fi
Hello everybody,
this is what I would like to do:
I have an index with documents containing a field 'authors'. I would like to
find all documents that have authors similar to a given author-string. One
could do this by a special query, relying on lucenes scoring/ranking mechanism.
But I would l
Hello,
I am new to Lucene. I am trying to highlight results for files on
disk. The file content is indexed as :
Reader freader = new FileReader(filepath);
doc.add(new Field("content", freader));
In the Highlighter.getBestFragments(tokenStream, text, .) api:
1) is tokenStream == analy
I will double check in the afternoon the heapdump.hprof. But I think that
*some* readers are indeed held by
docWriter.threadStates[0].consumer.fieldHash[1].fields[],
as shown in [1] (this heapdump contains only live objects). The heapdump
was taken after IndexWriter.commit() /IndexWriter.optim
Readers are not held. If you indexed the document and gced the document
instance they readers are gone.
-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de
> -Original Message-
> From: Ruben Laguna [mailto:ruben.lag...@gmail.com]
> Sen
And by the way, when is Lucene 3.1 coming?
On Thu, Apr 8, 2010 at 1:27 PM, Ruben Laguna wrote:
> Now that the zzBuffer issue is solved...
>
> what about the references to the Readers held by docWriter. Tika´s
> ParsingReaders are quite heavyweight so retaining those in memory
> unnecesarily is a
Now that the zzBuffer issue is solved...
what about the references to the Readers held by docWriter. Tika´s
ParsingReaders are quite heavyweight so retaining those in memory
unnecesarily is also a "hidden" memory leak. Should I open a bug report on
that one?
/Rubén
On Thu, Apr 8, 2010 at 12:11 P
Hello,
we would like to invite everyone interested in data storage, analysis and
search
to join us for two days on June 7/8th in Berlin for an in-depth, technical,
developer-focused conference located in the heart of Europe. Presentations will
range from beginner friendly introductions on the
Guess we were replying at the same time :).
On Thu, Apr 8, 2010 at 1:04 PM, Uwe Schindler wrote:
> I already answered, that I will take care of this!
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> > -Original
I responded, because the mentioned issue will change the whole class structure
in the standard package, so any patch would get outdated, soon. So it’s the
best to add it directly there.
But if you try out, if it works, that’s fine. The fix would be in 3.1 so if you
need to fix your 3.0.1 versio
That was fast! I was already writting a patch... just to see if it works.
On Thu, Apr 8, 2010 at 12:02 PM, Uwe Schindler wrote:
> Hi Shai, hi Ruben,
>
> I will take care of this in
> https://issues.apache.org/jira/browse/LUCENE-2074 where some parts of the
> Tokenizer impl are rewritten.
>
> ---
I already answered, that I will take care of this!
Uwe
-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de
> -Original Message-
> From: Shai Erera [mailto:ser...@gmail.com]
> Sent: Thursday, April 08, 2010 12:00 PM
> To: java-user@luce
Hi Shai, hi Ruben,
I will take care of this in https://issues.apache.org/jira/browse/LUCENE-2074
where some parts of the Tokenizer impl are rewritten.
-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de
> -Original Message-
> From: Ru
Yes, that's the trimBuffer version I was thinking about, only this guy
created a reset(Reader, int) and does both ops (resetting + trim) in one
method call. More convenient. Can you please open an issue to track that?
People will have a chance to comment on whether we (Lucene) should handle
that, o
I was investigating this a little further and in the JFlex mailing list I
found [1]
I don't know much about flex / JFlex but it seems that this guy resets the
zzBuffer to 16384 or less when setting the input for the lexer
Quoted from shef
I set
%buffer 0
in the options section, and then ad
Very interesting!
Newer versions of Lucene have cutover to dedicated utility class
(oal.util.StringHelper) for faster interning w/ threads. I wonder if
that'd help your case which Lucene version are you using?
Thanks for bringing closure,
Mike
On Wed, Apr 7, 2010 at 3:09 PM, britske wrote
If we could change the Flex file so that yyreset(Reader) would check the
size of zzBuffer, we could trim it when it gets too big. But I don't think
we have such control when writing the flex syntax ... yyreset is generated
by JFlex and that's the only place I can think of to trim the buffer down
wh
> I would like to identify also the problematic document I have 1 so,
> what
> would be the best way of identifying the one that it making zzBuffer to
> grow
> without control?
Dont index your documents, but instead pass them directly to the analyzer and
consume the tokenstream manually. Then
Hi Ruben,
as Shai already pointed out, the buffer with this large size is hold by
"StandardTokenizer", which is used in the "StandardAnalyzer". This code is out
of Lucene's control, as it is generated by the jFlex library.
As long as the IndexWriter instance is living, the buffer is hold implic
I'm using StandardAnalyzer.
I indeed parse large documents, xml and pdfs, using nekohtml and tika
respectively.
I took a look to the zzBuffer value contents (by exporting it to a file with
Eclipse MAT from the heapdump) and it seems to contain normal text from
several documents. See below
cat he
34 matches
Mail list logo