Re: IndexWriter flush/commit exception
Thanks Mike for a great explanation on Flush IOException I was thinking on the perspective of a HDFSDirectory. In addition to the all causes of IOException during flush you have listed, a HDFSDirectory also has to deal with network issues, which is not lucene's problem at all. But I would ideally like to handle momentary network blips, as these are fully recoverable errors. Will NRTCachingDirectory help in case of HDFSDirectory? If all goes well, I should always flush to RAM and sync to HDFS happens only during commits. In such cases, I can have a retry logic inside sync() method for handling momentary IOExceptions -- Ravi On Tue, Dec 17, 2013 at 9:14 PM, Michael McCandless luc...@mikemccandless.com wrote: On Mon, Dec 16, 2013 at 7:33 AM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: I am trying to model a transaction-log for lucene, which creates a transaction-log per-commit Things work fine during normal operations, but I cannot fathom the effect during a. IOException during Index-Commit Will the index be restored to previous commit-point? Can I blindly re-try operations from the current transaction log, after some time interval? Yes: if an IOException is thrown from IndexWriter.commit then the commit failed and the index still shows the previous successful commit. b. IOException during Background-Flush Will all the RAM buffers including deletes for that DWPT be cleaned up? flush() being per-thread and async obviously has problems with my transaction-log-per-commit approach, right? Most of the time, the IOExceptions are temporary and recoverable [Ex: Solr's HDFSDirectory etc...]. So, I must definitely retry these operations after some time-interval. IOExceptions during flush are trickier. Often it will mean all documents assigned to that segment are lost, but not necessarily (e.g. if the IOE happened while creating a compound file). IOExceptions during add/updateDocument are also possible (e.g. we write stored fields, term vectors per-doc), which can result in losing all documents in that one segment as well (an aborting exception), but e.g. an IOE thrown by the analyzer, will just result in that one document being lost (a non-aborting exception). Since you cannot know which case it was, it's probably safest to define a primary key field, and always use IW.updateDocument. This way if the document was in fact not lost, and you re-index it, you just replace it, instead of creating a duplicate. Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene Hierarchial Taxonomy Search
Hi, thanks for the answer. This could by a solution. But i have more than one hierarchic filed to query and i want to use the CategoryPath indexed in taxonomy. I'm using the DrillDown query: DrillDownQuery luceneQuery = new DrillDownQuery(searchParams.indexingParams); luceneQuery.add(new CategoryPath(book_category/novel/comedy,'/')); luceneQuery.add(new CategoryPath(subject/sub1/sub2,'/')); In this way the search return the books how match the two category paths and their descendants. To retrieve also the ancestors i can start the drilldown from a ancestor of the requested categoryPath (retrieved from the taxonomy). The problem is the same score for all the results. I want to override the similarity/score function in order to calculate a categoryPath lenght based score, comparing the query categoryPath with each returned document CategoryPath (book_category). E.g.: if(queryCategoryPath.compareTo(bookCategoryPath)==0){ document.score = 1 }else if(queryCategoryPath.compareTo(bookCategoryPath)==1){ document.score = 0.9 }else if(queryCategoryPath.compareTo(bookCategoryPath)==2){ document.score = 0.8 } and so on. -- View this message in context: http://lucene.472066.n3.nabble.com/Lucene-Hierarchial-Taxonomy-Search-tp4107100p4107226.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: IndexWriter flush/commit exception
On Wed, Dec 18, 2013 at 3:15 AM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: Thanks Mike for a great explanation on Flush IOException You're welcome! I was thinking on the perspective of a HDFSDirectory. In addition to the all causes of IOException during flush you have listed, a HDFSDirectory also has to deal with network issues, which is not lucene's problem at all. But I would ideally like to handle momentary network blips, as these are fully recoverable errors. Will NRTCachingDirectory help in case of HDFSDirectory? If all goes well, I should always flush to RAM and sync to HDFS happens only during commits. In such cases, I can have a retry logic inside sync() method for handling momentary IOExceptions I'm not sure it helps, because on merge, if the expected size of the merge segment is large enough, NRTCachingDir won't cache those files: it just delegates directly to the wrapped directory. Likewise, if too much RAM is already in use, flushing a new segment would go straight to the wrapped directory. You could make a custom Dir wrapper that always caches in RAM, but that sounds a bit terrifying :) Alternatively, maybe on an HDFS error you could block that one thread while you retry for some amount of time, until the write/read succeeds? (Like an NFS hard mount). Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Running Lucene tests on a custom Directory subclass
Never mind... the problem was that I compiled my jar against Lucene 3.3, but tried running against Lucene 4.4. It works when I also run against 3.3. (Or, at least, I get test failures that make sense!) Scott -Original Message- From: Scott Schneider [mailto:scott_schnei...@symantec.com] Sent: Tuesday, December 17, 2013 5:28 PM To: java-user@lucene.apache.org Subject: Running Lucene tests on a custom Directory subclass Hello, I'm trying to run Lucene's unit tests on Lucene Transform's TransformedDirectory. I get an AbstractMethodError on createOutput(), but I'm quite sure that method is defined. Here are a few lines from the error: test: ... [junit4] Suite: org.apache.lucene.index.TestStressAdvance [junit4] 2 NOTE: reproduce with: ant test - Dtestcase=TestStressAdvance -Dtests.method=testStressAdvance - Dtests.seed=7285A7FB616F8E90 -Dtests.slow=true - Dtests.directory=org.apache.lucene.store.transform.TransformedDirectory LuceneTestWrapper -Dtests.locale=sr_BA -Dtests.timezone=Eire - Dtests.file.encoding=US-ASCII [junit4] ERROR 0.44s J3 | TestStressAdvance.testStressAdvance [junit4] Throwable #1: java.lang.AbstractMethodError: org.apache.lucene.store.Directory.createOutput(Ljava/lang/String;Lorg/a pache/lucene/store/IOContext;)Lorg/apache/lucene/store/IndexOutput; [junit4] at __randomizedtesting.SeedInfo.seed([7285A7FB616F8E90:E105EDB90E7666C2]:0 ) [junit4] at org.apache.lucene.store.MockDirectoryWrapper.createOutput(MockDirectory Wrapper.java:495) [junit4] at org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingD irectoryWrapper.java:62) ... [junit4] at org.apache.lucene.index.TestStressAdvance.testStressAdvance(TestStressA dvance.java:56) [junit4] at java.lang.Thread.run(Thread.java:722) [junit4] 2 NOTE: test params are: codec=Lucene40, sim=DefaultSimilarity, locale=sr_BA, timezone=Eire [junit4] 2 NOTE: Windows 7 6.1 x86/Oracle Corporation 1.7.0_03 (32-bit)/cpus=8,threads=1,free=5157752,total=16252928 [junit4] 2 NOTE: All tests run in this JVM: [TestMathUtil, TestStressAdvance] [junit4] Completed on J3 in 0.47s, 1 test, 1 error FAILURES! To run the tests, I use ant test - Dtests.directory=org.blahblah.TransformedDirectoryLuceneTestWrapper - lib blahblah. I created the TransformedDirectoryLuceneTestWrapper class (as a subclass of TransformedDirectory) and gave it a 0-argument constructor. Apologies if this was addressed elsewhere. In googling for an answer, the term Directory is basically invisible. I found a page on running Lucene's tests on a custom codec and approximated those steps. Scott - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Index Size in bytes
How can I get the size of the whole index in bytes? regards -Siraj (212) 306-0154 This electronic mail message and any attachments may contain information which is privileged, sensitive and/or otherwise exempt from disclosure under applicable law. The information is intended only for the use of the individual or entity named as the addressee above. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution (electronic or otherwise) or forwarding of, or the taking of any action in reliance on, the contents of this transmission is strictly prohibited. If you have received this electronic transmission in error, please notify us by telephone, facsimile, or e-mail as noted above to arrange for the return of any electronic mail or attachments. Thank You.
Re: Index Size in bytes
Use Directory.listAll to get all files, then visit each one and call Directory.fileLength, and sum those up? Note that this gives you total size of all commit points, which may be over-counting on Windows in cases where IndexWriter has removed old commit points but IndexReaders still have the files open. Mike McCandless http://blog.mikemccandless.com On Wed, Dec 18, 2013 at 5:16 PM, Siraj Haider si...@jobdiva.com wrote: How can I get the size of the whole index in bytes? regards -Siraj (212) 306-0154 This electronic mail message and any attachments may contain information which is privileged, sensitive and/or otherwise exempt from disclosure under applicable law. The information is intended only for the use of the individual or entity named as the addressee above. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution (electronic or otherwise) or forwarding of, or the taking of any action in reliance on, the contents of this transmission is strictly prohibited. If you have received this electronic transmission in error, please notify us by telephone, facsimile, or e-mail as noted above to arrange for the return of any electronic mail or attachments. Thank You. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Debugging unit tests with Eclipse
I'm trying to run and debug the unit tests for Lucene 3.3.0 using Eclipse. I loaded src/java, src/test, and src/test-framework into 3 projects in my workspace and got it all compiling. I created a debug configuration for tests, but I get 54 unit test failures. I can copy the list if anyone wants. Most, but not all, have to do with CJK and other languages. Also, I'm trying to test using a directory class. In the debug configurations window, I set -Dtests.directory=org.apache.lucene.store.transform.TransformedDirectoryLuceneTestWrapper as an argument. ant test -Dblahblah works, but this doesn't use that argument and gets the same 54 test failures, like normal. Please help! Thanks, Scott
Re: IndexWriter flush/commit exception
You could make a custom Dir wrapper that always caches in RAM, but that sounds a bit terrifying :) This was exactly what I implemented:) A commit-thread runs periodically every 30 seconds, while RAM-Monitor thread runs every 5 seconds to commit data in-case sizeInBytes=70%-of-maxCachedBytes. This is quite dangerous as you have said, especially when sync() can take an arbitrary amount of time Alternatively, maybe on an HDFS error you could block that one thread while you retry for some amount of time, until the write/read succeeds? (Like an NFS hard mount). Well, after your idea I started digging HDFS for this problem. I believe HDFS handles this internally without a snitch, as per this link. https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-3/data-flow I believe in the case of a node failure while writing, even an IOException is also not thrown to the client and all of it is handled internally. I think I can rest-easy on this. May be will write a test-case to verify this behavior. Sorry for the trouble. Should have done some digging before-hand. -- Ravi On Wed, Dec 18, 2013 at 11:55 PM, Michael McCandless luc...@mikemccandless.com wrote: On Wed, Dec 18, 2013 at 3:15 AM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: Thanks Mike for a great explanation on Flush IOException You're welcome! I was thinking on the perspective of a HDFSDirectory. In addition to the all causes of IOException during flush you have listed, a HDFSDirectory also has to deal with network issues, which is not lucene's problem at all. But I would ideally like to handle momentary network blips, as these are fully recoverable errors. Will NRTCachingDirectory help in case of HDFSDirectory? If all goes well, I should always flush to RAM and sync to HDFS happens only during commits. In such cases, I can have a retry logic inside sync() method for handling momentary IOExceptions I'm not sure it helps, because on merge, if the expected size of the merge segment is large enough, NRTCachingDir won't cache those files: it just delegates directly to the wrapped directory. Likewise, if too much RAM is already in use, flushing a new segment would go straight to the wrapped directory. You could make a custom Dir wrapper that always caches in RAM, but that sounds a bit terrifying :) Alternatively, maybe on an HDFS error you could block that one thread while you retry for some amount of time, until the write/read succeeds? (Like an NFS hard mount). Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Phrase indexing and searching
Dear list, My Lucene programme is able to index single words and search the most matching documents (based on term frequencies) documents from a corpus to the input document. Now I want to index two word phrases and search the matching corpus documents (based on phrase frequencies) to the input documents. ex:- input document: blue house is very beautiful split it into phrases (say two term phrases) like: blue house house very very beautiful etc. Is it possible to do this with Lucene? If so how can I do it? Thanks, Manjula.