Hi Priyanka,
How can I add Maching Learning Part in Apache Lucene .
I think your question is too wide to asnwer because machine learning
covers a lot of things...
Lucene has already got a text categorization function which is a well
known task of NLP and NLP is a part of machine learning.
Hi,
Now if I don't
close the old index reader I am noticing increases of virtual memory with
every new reindex reopen (this should not be an issue on 64 bit Linux
correct - this is the configuration I am using and the indexes are on a
shared mount NTFS file system ).
This always brings
Hi Teko,
sure - I use Lucene though elasticsearch, but I suppose that doesnt make a
difference in this situation. I needed something like what you were trying
to accomplish - basically to search any substring... wildcarded queries
worked but were kind of slow.
This is my analyzer that works for
Hmm, try calling maybeMerge after each .addIndexes?
Robert opened this issue to fix addIndexes:
https://issues.apache.org/jira/browse/LUCENE-5672
Mike McCandless
http://blog.mikemccandless.com
On Wed, May 14, 2014 at 11:46 AM, danielv dani...@exlibris.co.il wrote:
Hi,
We have about 550M
reader.document(i) and searcher.doc(i) do the same thing: retrieve the
stored fields.
But neither method fully preserves indexing information; e.g., boosts
are lost, details about how the field was indexed (e.g., DOCS_ONLY,
et.c) are lost, etc.
You can use the returned document to provide the
True, for the first two use cases, but as I indicated, the third use case is
problematic since the token needs to be split. The n-gram solution does seem
to cover it though, sort of.
The n-gram solution doesn't cover good morning, john or good morning -
john, but that could be handled by
delGen=-1 means there are no deletions, but the exception makes no
sense because up above SegmentReader.java calls si.hasDeletions()
which returns delGen != -1 which should have mean
Lucene40LiveDocsFormat.readLiveDocs should not have been called. It
seems impossible :)
What java version?
Mike
On May 7, 2014, at 6:46 AM, Cheng zhoucheng2...@gmail.com wrote:
I have an index of multiple gigabytes which serves 5-10 threads and needs
refreshing very often. I wonder if RAMDirectory is the good candidate for
this purpose. If not, what kind of directory is better?
We found that loading
Michael
How do you update a document that resides in the index without having
the original document?
Jamie
On 2014/05/13, 3:30 PM, Michael McCandless wrote:
How did you produce the document that you are sending to
updateDocument? Are you loading it from IndexReader.document() or
addIndexes doesn't call maybeMerge, so i think you are just getting in
a situation with too many segments, so applying deletes is slow.
can you try calling IndexWriter.maybeMerge() after you call
addIndexes? (it wont have immediate impact, you have to do some merges
to get your index healthy
Is this true as well for 2.9.2?
-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com]
Sent: Wednesday, May 14, 2014 8:54 AM
To: Lucene Users
Subject: Re: best choice for ramBufferSizeMB
Generally larger is better, as long as JVM's heap is big enough to allow IW
On Tue, May 13, 2014 at 1:34 AM, Sven Teichmann s.teichm...@s4ip.de wrote:
Hi,
I also found this response very useful and right now I am playing around
with DocValues.
If the default DocValuesFormat isn't fast enough, you can always
switch to e.g. DirectDocValuesFormat (uses lots of RAM but
Hello everyone,
We have a performance issue ever since we stopped optimizing the index. We are
using Lucene 4.8 (jvm 32bits for searching, 64bits for indexing) on Windows
2008R2.
Now we are letting Lucene handle the merges using the default merge policy
(TieredMergePolicy).
We have narrowed
Thanks for your answer.
At the moment we use one single thread for indexing. Working with several
threads is a possibility we should try. Testing with different values for
ramBufferSizeMB between 16 MB and 256MB showed that up from 128 MB there was no
improvement as you already mentioned.
Emanuel Buzek,
Well, I tried using the method 'ShingleFilter' first, and I thought it
worked well, but, at last, it still did not work like I want..
So, I tried use NGram... I created a new analyzer to use it, and, I did a
test... Well, it works, but, I still need do some manually validation to
On Wed, 2014-05-07 at 15:46 +0200, Cheng wrote:
I have an index of multiple gigabytes which serves 5-10 threads and needs
refreshing very often. I wonder if RAMDirectory is the good candidate for
this purpose. If not, what kind of directory is better?
RAMDirectory will probably give you poor
Wow man!!
Forget what I said before!! I did tries using your method... well, to
generate the index, really, it's still a bit more slow (1/2 minutes more),
but, in query... man, It's work very well, and, fast, very fast!!
Really, here is so fast that what generate the bottleneck, is the write
You can retrieve the raw content for each field (assuming you stored it).
But then you must re-generate a Document from the raw content
yourself, as you did originally.
Ie you cannot rely on Lucene to remember schema-like things like
boost, the FieldType (how the postings were indexed, whether
I've actually been wondering about this as well. More specifically, I've been
wondering if there's any kind of framework to integrate some sort of learn to
rank approach (http://en.wikipedia.org/wiki/Learning_to_rank) to Lucene/Solr.
Although a similar result can be accomplished by using
19 matches
Mail list logo