Re: IndexWriter flush/commit exception

2013-12-18 Thread Ravikumar Govindarajan
Thanks Mike for a great explanation on Flush IOException

I was thinking on the perspective of a HDFSDirectory. In addition to the
all causes of IOException during flush you have listed, a HDFSDirectory
also has to deal with network issues, which is not lucene's problem at all.

But I would ideally like to handle momentary network blips, as these are
fully recoverable errors.


Will NRTCachingDirectory help in case of HDFSDirectory? If all goes well, I
should always flush to RAM and sync to HDFS happens only during commits. In
such cases, I can have a retry logic inside sync() method for handling
momentary IOExceptions


--
Ravi


On Tue, Dec 17, 2013 at 9:14 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 On Mon, Dec 16, 2013 at 7:33 AM, Ravikumar Govindarajan
 ravikumar.govindara...@gmail.com wrote:
  I am trying to model a transaction-log for lucene, which creates a
  transaction-log per-commit
 
  Things work fine during normal operations, but I cannot fathom the effect
  during
 
  a. IOException during Index-Commit
 
  Will the index be restored to previous commit-point? Can I blindly re-try
  operations from the current transaction log, after some time interval?

 Yes: if an IOException is thrown from IndexWriter.commit then the
 commit failed and the index still shows the previous successful
 commit.

  b. IOException during Background-Flush
 
  Will all the RAM buffers including deletes for that DWPT be cleaned up?
  flush() being per-thread and async obviously has problems with my
  transaction-log-per-commit approach, right?
 
  Most of the time, the IOExceptions are temporary and recoverable [Ex:
  Solr's HDFSDirectory etc...]. So, I must definitely retry these
 operations
  after some time-interval.

 IOExceptions during flush are trickier.  Often it will mean all
 documents assigned to that segment are lost, but not necessarily (e.g.
 if the IOE happened while creating a compound file).

 IOExceptions during add/updateDocument are also possible (e.g. we
 write stored fields, term vectors per-doc), which can result in losing
 all documents in that one segment as well (an aborting exception), but
 e.g. an IOE thrown by the analyzer, will just result in that one
 document being lost (a non-aborting exception).

 Since you cannot know which case it was, it's probably safest to
 define a primary key field, and always use IW.updateDocument.  This
 way if the document was in fact not lost, and you re-index it, you
 just replace it, instead of creating a duplicate.

 Mike McCandless

 http://blog.mikemccandless.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Lucene Hierarchial Taxonomy Search

2013-12-18 Thread Nino_87
Hi, thanks for the answer.
This could by a solution. But i have more than one hierarchic filed to query
and i want to use the CategoryPath indexed in taxonomy.
I'm using the DrillDown query:

DrillDownQuery luceneQuery = new
DrillDownQuery(searchParams.indexingParams);
luceneQuery.add(new CategoryPath(book_category/novel/comedy,'/'));
luceneQuery.add(new CategoryPath(subject/sub1/sub2,'/'));

In this way the search return the books how match the two category paths and
their descendants.
To retrieve also the ancestors i can start the drilldown from a ancestor of
the requested categoryPath (retrieved from the taxonomy).

The problem is the same score for all the results.
I want to override the similarity/score function in order to calculate a
categoryPath lenght based score, comparing the query categoryPath with each
returned document CategoryPath (book_category).
 
E.g.: 
if(queryCategoryPath.compareTo(bookCategoryPath)==0){
document.score = 1
}else if(queryCategoryPath.compareTo(bookCategoryPath)==1){
document.score = 0.9
}else if(queryCategoryPath.compareTo(bookCategoryPath)==2){
document.score = 0.8
} and so on.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Lucene-Hierarchial-Taxonomy-Search-tp4107100p4107226.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: IndexWriter flush/commit exception

2013-12-18 Thread Michael McCandless
On Wed, Dec 18, 2013 at 3:15 AM, Ravikumar Govindarajan
ravikumar.govindara...@gmail.com wrote:
 Thanks Mike for a great explanation on Flush IOException

You're welcome!

 I was thinking on the perspective of a HDFSDirectory. In addition to the
 all causes of IOException during flush you have listed, a HDFSDirectory
 also has to deal with network issues, which is not lucene's problem at all.

 But I would ideally like to handle momentary network blips, as these are
 fully recoverable errors.


 Will NRTCachingDirectory help in case of HDFSDirectory? If all goes well, I
 should always flush to RAM and sync to HDFS happens only during commits. In
 such cases, I can have a retry logic inside sync() method for handling
 momentary IOExceptions

I'm not sure it helps, because on merge, if the expected size of the
merge segment is large enough, NRTCachingDir won't cache those files:
it just delegates directly to the wrapped directory.

Likewise, if too much RAM is already in use, flushing a new segment
would go straight to the wrapped directory.

You could make a custom Dir wrapper that always caches in RAM, but
that sounds a bit terrifying :)

Alternatively, maybe on an HDFS error you could block that one thread
while you retry for some amount of time, until the write/read
succeeds?  (Like an NFS hard mount).

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Running Lucene tests on a custom Directory subclass

2013-12-18 Thread Scott Schneider
Never mind... the problem was that I compiled my jar against Lucene 3.3, but 
tried running against Lucene 4.4.  It works when I also run against 3.3.  (Or, 
at least, I get test failures that make sense!)

Scott


 -Original Message-
 From: Scott Schneider [mailto:scott_schnei...@symantec.com]
 Sent: Tuesday, December 17, 2013 5:28 PM
 To: java-user@lucene.apache.org
 Subject: Running Lucene tests on a custom Directory subclass
 
 Hello,
 
 I'm trying to run Lucene's unit tests on Lucene Transform's
 TransformedDirectory.  I get an AbstractMethodError on createOutput(),
 but I'm quite sure that method is defined.  Here are a few lines from
 the error:
 
 test:
 ...
[junit4] Suite: org.apache.lucene.index.TestStressAdvance
[junit4]   2 NOTE: reproduce with: ant test  -
 Dtestcase=TestStressAdvance -Dtests.method=testStressAdvance -
 Dtests.seed=7285A7FB616F8E90 -Dtests.slow=true -
 Dtests.directory=org.apache.lucene.store.transform.TransformedDirectory
 LuceneTestWrapper -Dtests.locale=sr_BA -Dtests.timezone=Eire -
 Dtests.file.encoding=US-ASCII
[junit4] ERROR   0.44s J3 | TestStressAdvance.testStressAdvance 
[junit4] Throwable #1: java.lang.AbstractMethodError:
 org.apache.lucene.store.Directory.createOutput(Ljava/lang/String;Lorg/a
 pache/lucene/store/IOContext;)Lorg/apache/lucene/store/IndexOutput;
[junit4]   at
 __randomizedtesting.SeedInfo.seed([7285A7FB616F8E90:E105EDB90E7666C2]:0
 )
[junit4]   at
 org.apache.lucene.store.MockDirectoryWrapper.createOutput(MockDirectory
 Wrapper.java:495)
[junit4]   at
 org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingD
 irectoryWrapper.java:62)
 ...
[junit4]   at
 org.apache.lucene.index.TestStressAdvance.testStressAdvance(TestStressA
 dvance.java:56)
[junit4]   at java.lang.Thread.run(Thread.java:722)
[junit4]   2 NOTE: test params are: codec=Lucene40,
 sim=DefaultSimilarity, locale=sr_BA, timezone=Eire
[junit4]   2 NOTE: Windows 7 6.1 x86/Oracle Corporation 1.7.0_03
 (32-bit)/cpus=8,threads=1,free=5157752,total=16252928
[junit4]   2 NOTE: All tests run in this JVM: [TestMathUtil,
 TestStressAdvance]
[junit4] Completed on J3 in 0.47s, 1 test, 1 error  FAILURES!
 
 To run the tests, I use ant test -
 Dtests.directory=org.blahblah.TransformedDirectoryLuceneTestWrapper -
 lib blahblah.  I created the TransformedDirectoryLuceneTestWrapper
 class (as a subclass of TransformedDirectory) and gave it a 0-argument
 constructor.
 
 
 Apologies if this was addressed elsewhere.  In googling for an answer,
 the term Directory is basically invisible.  I found a page on running
 Lucene's tests on a custom codec and approximated those steps.
 
 Scott


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Index Size in bytes

2013-12-18 Thread Siraj Haider
How can I get the size of the whole index in bytes?

regards
-Siraj
(212) 306-0154



This electronic mail message and any attachments may contain information which 
is privileged, sensitive and/or otherwise exempt from disclosure under 
applicable law. The information is intended only for the use of the individual 
or entity named as the addressee above. If you are not the intended recipient, 
you are hereby notified that any disclosure, copying, distribution (electronic 
or otherwise) or forwarding of, or the taking of any action in reliance on, the 
contents of this transmission is strictly prohibited. If you have received this 
electronic transmission in error, please notify us by telephone, facsimile, or 
e-mail as noted above to arrange for the return of any electronic mail or 
attachments. Thank You.


Re: Index Size in bytes

2013-12-18 Thread Michael McCandless
Use Directory.listAll to get all files, then visit each one and call
Directory.fileLength, and sum those up?

Note that this gives you total size of all commit points, which may be
over-counting on Windows in cases where IndexWriter has removed old
commit points but IndexReaders still have the files open.

Mike McCandless

http://blog.mikemccandless.com


On Wed, Dec 18, 2013 at 5:16 PM, Siraj Haider si...@jobdiva.com wrote:
 How can I get the size of the whole index in bytes?

 regards
 -Siraj
 (212) 306-0154


 
 This electronic mail message and any attachments may contain information 
 which is privileged, sensitive and/or otherwise exempt from disclosure under 
 applicable law. The information is intended only for the use of the 
 individual or entity named as the addressee above. If you are not the 
 intended recipient, you are hereby notified that any disclosure, copying, 
 distribution (electronic or otherwise) or forwarding of, or the taking of any 
 action in reliance on, the contents of this transmission is strictly 
 prohibited. If you have received this electronic transmission in error, 
 please notify us by telephone, facsimile, or e-mail as noted above to arrange 
 for the return of any electronic mail or attachments. Thank You.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Debugging unit tests with Eclipse

2013-12-18 Thread Scott Schneider
I'm trying to run and debug the unit tests for Lucene 3.3.0 using Eclipse.  I 
loaded src/java, src/test, and src/test-framework into 3 projects in my 
workspace and got it all compiling.  I created a debug configuration for tests, 
but I get 54 unit test failures.  I can copy the list if anyone wants.  Most, 
but not all, have to do with CJK and other languages.

Also, I'm trying to test using a directory class.  In the debug configurations 
window, I set  
-Dtests.directory=org.apache.lucene.store.transform.TransformedDirectoryLuceneTestWrapper
 as an argument.  ant test -Dblahblah works, but this doesn't use that 
argument and gets the same 54 test failures, like normal.

Please help!

Thanks,
Scott



Re: IndexWriter flush/commit exception

2013-12-18 Thread Ravikumar Govindarajan
 You could make a custom Dir wrapper that always caches in RAM, but
 that sounds a bit terrifying :)

This was exactly what I implemented:) A commit-thread runs periodically
every 30 seconds, while RAM-Monitor thread runs every 5 seconds to commit
data in-case sizeInBytes=70%-of-maxCachedBytes. This is quite dangerous as
you have said, especially when sync() can take an arbitrary amount of time

 Alternatively, maybe on an HDFS error you could block that one thread
 while you retry for some amount of time, until the write/read
 succeeds?  (Like an NFS hard mount).

Well, after your idea I started digging HDFS for this problem. I believe
HDFS handles this internally without a snitch, as per this link.
https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-3/data-flow

I believe in the case of a node failure while writing, even an IOException
is also not thrown to the client and all of it is handled internally. I
think I can rest-easy on this.
May be will write a test-case to verify this behavior.

Sorry for the trouble. Should have done some digging before-hand.

--
Ravi

On Wed, Dec 18, 2013 at 11:55 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 On Wed, Dec 18, 2013 at 3:15 AM, Ravikumar Govindarajan
 ravikumar.govindara...@gmail.com wrote:
  Thanks Mike for a great explanation on Flush IOException

 You're welcome!

  I was thinking on the perspective of a HDFSDirectory. In addition to the
  all causes of IOException during flush you have listed, a HDFSDirectory
  also has to deal with network issues, which is not lucene's problem at
 all.
 
  But I would ideally like to handle momentary network blips, as these are
  fully recoverable errors.
 
 
  Will NRTCachingDirectory help in case of HDFSDirectory? If all goes
 well, I
  should always flush to RAM and sync to HDFS happens only during commits.
 In
  such cases, I can have a retry logic inside sync() method for handling
  momentary IOExceptions

 I'm not sure it helps, because on merge, if the expected size of the
 merge segment is large enough, NRTCachingDir won't cache those files:
 it just delegates directly to the wrapped directory.

 Likewise, if too much RAM is already in use, flushing a new segment
 would go straight to the wrapped directory.

 You could make a custom Dir wrapper that always caches in RAM, but
 that sounds a bit terrifying :)

 Alternatively, maybe on an HDFS error you could block that one thread
 while you retry for some amount of time, until the write/read
 succeeds?  (Like an NFS hard mount).

 Mike McCandless

 http://blog.mikemccandless.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Phrase indexing and searching

2013-12-18 Thread Manjula Wijewickrema
Dear list,

My Lucene programme is able to index single words and search the most
matching documents (based on term frequencies) documents from a corpus to
the input document.
Now I want to index two word phrases and search the matching corpus
documents (based on phrase frequencies) to the input documents.

ex:-
input document:
blue house is very beautiful

split it into phrases (say two term phrases) like:
blue house
house very
very beautiful
etc.

 Is it possible to do this with Lucene? If so how can I do it?

Thanks,

Manjula.