Re: How to add machine learning to Apache lucene

2014-05-16 Thread Koji Sekiguchi
Hi Priyanka, How can I add Maching Learning Part in Apache Lucene . I think your question is too wide to asnwer because machine learning covers a lot of things... Lucene has already got a text categorization function which is a well known task of NLP and NLP is a part of machine learning.

RE: Issue with Lucene 3.6.1 and MMapDirectory

2014-05-16 Thread Uwe Schindler
Hi, Now if I don't close the old index reader I am noticing increases of virtual memory with every new reindex reopen (this should not be an issue on 64 bit Linux correct - this is the configuration I am using and the indexes are on a shared mount NTFS file system ). This always brings

Re: How to locate a Phrase inside text (like a Browser text searcher)

2014-05-16 Thread Emanuel Buzek
Hi Teko, sure - I use Lucene though elasticsearch, but I suppose that doesnt make a difference in this situation. I needed something like what you were trying to accomplish - basically to search any substring... wildcarded queries worked but were kind of slow. This is my analyzer that works for

Re: Merger performance degradation on 3.6.1

2014-05-16 Thread Michael McCandless
Hmm, try calling maybeMerge after each .addIndexes? Robert opened this issue to fix addIndexes: https://issues.apache.org/jira/browse/LUCENE-5672 Mike McCandless http://blog.mikemccandless.com On Wed, May 14, 2014 at 11:46 AM, danielv dani...@exlibris.co.il wrote: Hi, We have about 550M

Re: writer.updateDocument() not working (possible bug?)

2014-05-16 Thread Michael McCandless
reader.document(i) and searcher.doc(i) do the same thing: retrieve the stored fields. But neither method fully preserves indexing information; e.g., boosts are lost, details about how the field was indexed (e.g., DOCS_ONLY, et.c) are lost, etc. You can use the returned document to provide the

Re: How to locate a Phrase inside text (like a Browser text searcher)

2014-05-16 Thread Jack Krupansky
True, for the first two use cases, but as I indicated, the third use case is problematic since the token needs to be split. The n-gram solution does seem to cover it though, sort of. The n-gram solution doesn't cover good morning, john or good morning - john, but that could be handled by

Re: [lucene 4.6] NPE when calling IndexReader#openIfChanged

2014-05-16 Thread Michael McCandless
delGen=-1 means there are no deletions, but the exception makes no sense because up above SegmentReader.java calls si.hasDeletions() which returns delGen != -1 which should have mean Lucene40LiveDocsFormat.readLiveDocs should not have been called. It seems impossible :) What java version? Mike

Re: Can RAMDirectory work for gigabyte data which needs refreshing of the index all the time?

2014-05-16 Thread Steven Schlansker
On May 7, 2014, at 6:46 AM, Cheng zhoucheng2...@gmail.com wrote: I have an index of multiple gigabytes which serves 5-10 threads and needs refreshing very often. I wonder if RAMDirectory is the good candidate for this purpose. If not, what kind of directory is better? We found that loading

Re: writer.updateDocument() not working (possible bug?)

2014-05-16 Thread Jamie
Michael How do you update a document that resides in the index without having the original document? Jamie On 2014/05/13, 3:30 PM, Michael McCandless wrote: How did you produce the document that you are sending to updateDocument? Are you loading it from IndexReader.document() or

Re: Merger performance degradation on 3.6.1

2014-05-16 Thread Robert Muir
addIndexes doesn't call maybeMerge, so i think you are just getting in a situation with too many segments, so applying deletes is slow. can you try calling IndexWriter.maybeMerge() after you call addIndexes? (it wont have immediate impact, you have to do some merges to get your index healthy

RE: best choice for ramBufferSizeMB

2014-05-16 Thread Baldwin, David
Is this true as well for 2.9.2? -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Wednesday, May 14, 2014 8:54 AM To: Lucene Users Subject: Re: best choice for ramBufferSizeMB Generally larger is better, as long as JVM's heap is big enough to allow IW

Re: Best practice to map Lucene docids to real ids

2014-05-16 Thread Michael McCandless
On Tue, May 13, 2014 at 1:34 AM, Sven Teichmann s.teichm...@s4ip.de wrote: Hi, I also found this response very useful and right now I am playing around with DocValues. If the default DocValuesFormat isn't fast enough, you can always switch to e.g. DirectDocValuesFormat (uses lots of RAM but

search time number of segments

2014-05-16 Thread De Simone, Alessandro
Hello everyone, We have a performance issue ever since we stopped optimizing the index. We are using Lucene 4.8 (jvm 32bits for searching, 64bits for indexing) on Windows 2008R2. Now we are letting Lucene handle the merges using the default merge policy (TieredMergePolicy). We have narrowed

AW: best choice for ramBufferSizeMB

2014-05-16 Thread Gudrun Siedersleben
Thanks for your answer. At the moment we use one single thread for indexing. Working with several threads is a possibility we should try. Testing with different values for ramBufferSizeMB between 16 MB and 256MB showed that up from 128 MB there was no improvement as you already mentioned.

Re: How to locate a Phrase inside text (like a Browser text searcher)

2014-05-16 Thread teko
Emanuel Buzek, Well, I tried using the method 'ShingleFilter' first, and I thought it worked well, but, at last, it still did not work like I want.. So, I tried use NGram... I created a new analyzer to use it, and, I did a test... Well, it works, but, I still need do some manually validation to

Re: Can RAMDirectory work for gigabyte data which needs refreshing of the index all the time?

2014-05-16 Thread Toke Eskildsen
On Wed, 2014-05-07 at 15:46 +0200, Cheng wrote: I have an index of multiple gigabytes which serves 5-10 threads and needs refreshing very often. I wonder if RAMDirectory is the good candidate for this purpose. If not, what kind of directory is better? RAMDirectory will probably give you poor

Re: How to locate a Phrase inside text (like a Browser text searcher)

2014-05-16 Thread teko
Wow man!! Forget what I said before!! I did tries using your method... well, to generate the index, really, it's still a bit more slow (1/2 minutes more), but, in query... man, It's work very well, and, fast, very fast!! Really, here is so fast that what generate the bottleneck, is the write

Re: writer.updateDocument() not working (possible bug?)

2014-05-16 Thread Michael McCandless
You can retrieve the raw content for each field (assuming you stored it). But then you must re-generate a Document from the raw content yourself, as you did originally. Ie you cannot rely on Lucene to remember schema-like things like boost, the FieldType (how the postings were indexed, whether

Re: How to add machine learning to Apache lucene

2014-05-16 Thread Diego Fernandez
I've actually been wondering about this as well. More specifically, I've been wondering if there's any kind of framework to integrate some sort of learn to rank approach (http://en.wikipedia.org/wiki/Learning_to_rank) to Lucene/Solr. Although a similar result can be accomplished by using