[jira] Updated: (LUCENE-665) temporary file access denied on Windows
[ http://issues.apache.org/jira/browse/LUCENE-665?page=all ] Doron Cohen updated LUCENE-665: --- Attachment: FSWinDirectory_26_Sep_06.patch Updated the patch according to review comments by Hoss, plus: - protect currMillis usage from system clock modifications. - all Win specific code in a single Java file with two inner classes, for "cleaner" javadocs (now waitForRetry() is provate). Tested as previous patch: - "ant test" passes with new code. - For test, modified build-common.xml to set a system property so that the new WinFS class was always in effect and ran the tests - all passed. - my stress test TestinterleavedAddAndRemoves fails in my env by default and passes when FSWinDirectory is in effect. > temporary file access denied on Windows > --- > > Key: LUCENE-665 > URL: http://issues.apache.org/jira/browse/LUCENE-665 > Project: Lucene - Java > Issue Type: Bug > Components: Store >Affects Versions: 2.0.0 > Environment: Windows >Reporter: Doron Cohen > Attachments: FSDirectory_Retry_Logic.patch, > FSDirs_Retry_Logic_3.patch, FSWinDirectory.patch, > FSWinDirectory_26_Sep_06.patch, Test_Output.txt, > TestInterleavedAddAndRemoves.java > > > When interleaving adds and removes there is frequent opening/closing of > readers and writers. > I tried to measure performance in such a scenario (for issue 565), but the > performance test failed - the indexing process crashed consistently with > file "access denied" errors - "cannot create a lock file" in > "lockFile.createNewFile()" and "cannot rename file". > This is related to: > - issue 516 (a closed issue: "TestFSDirectory fails on Windows") - > http://issues.apache.org/jira/browse/LUCENE-516 > - user list questions due to file errors: > - > http://www.nabble.com/OutOfMemory-and-IOException-Access-Denied-errors-tf1649795.html > - > http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html > - discussion on lock-less commits > http://www.nabble.com/Lock-less-commits-tf2126935.html > My test setup is: XP (SP1), JAVA 1.5 - both SUN and IBM SDKs. > I noticed that the problem is more frequent when locks are created on one > disk and the index on another. Both are NTFS with Windows indexing service > enabled. I suspect this indexing service might be related - keeping files > busy for a while, but don't know for sure. > After experimenting with it I conclude that these problems - at least in my > scenario - are due to a temporary situation - the FS, or the OS, is > *temporarily* holding references to files or folders, preventing from > renaming them, deleting them, or creating new files in certain directories. > So I added to FSDirectory a retry logic in cases the error was related to > "Access Denied". This is the same approach brought in > http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html > - there, in addition to the retry, gc() is invoked (I did not gc()). This is > based on the *hope* that a access-denied situation would vanish after a small > delay, and the retry would succeed. > I modified FSDirectory this way for "Access Denied" errors during creating a > new files, renaming a file. > This worked fine for me. The performance test that failed before, now managed > to complete. There should be no performance implications due to this > modification, because only the cases that would otherwise wrongly fail are > now delaying some extra millis and retry. > I am attaching here a patch - FSDirectory_Retry_Logic.patch - that has these > changes to FSDirectory. > All "ant test" tests pass with this patch. > Also attaching a test case that demostrates the problem - at least on my > machine. There two tests cases in that test file - one that works in system > temp (like most Lucene tests) and one that creates the index in a different > disk. The latter case can only run if the path ("D:" , "tmp") is valid. > It would be great if people that experienced these problems could try out > this patch and comment whether it made any difference for them. > If it turns out useful for others as well, including this patch in the code > might help to relieve some of those "frustration" user cases. > A comment on state of proposed patch: > - It is not a "ready to deploy" code - it has some debug printing, showing > the cases that the "retry logic" actually took place. > - I am not sure if current 30ms is the right delay... why not 50ms? 10ms? > This is currently defined by a constant. > - Should a call to gc() be added? (I think not.) > - Should the retry be attempted also on "non access-denied" exceptions? (I > think not). > - I feel it is somewhat "woodoo programming", but though I don't like it, it > seems to work... > Attached files: > 1. TestInterleave
Re: highlight - scoring fragments with more of the same token
: TF is not a factor in fragment scores because I found its typically more : useful to look for fragments containing a strong mix of the query terms : - not merely repetitions of the same term. The idea is the choice of : scorer is pluggable if you don't like the default behaviour. Taking a "coord" factor into consideration in that case may help balance out the benefits of tf weighting vs mixed terms. (myabe the default highlighting options already do that, i'm not sure ... just tossing it out as a comment from the peanut gallery) -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: highlight - scoring fragments with more of the same token
I was somewhat surprised to find that highlighting scoring simply counts how many unique query terms appear in the fragment. Guess was expecting a See QueryScorer(Query query, IndexReader reader, String fieldName) constructor - this will factor IDF into weighting for terms. Query boosts are automatically factored in too. TF is not a factor in fragment scores because I found its typically more useful to look for fragments containing a strong mix of the query terms - not merely repetitions of the same term. The idea is the choice of scorer is pluggable if you don't like the default behaviour. The possibility of adding smarter fragmenting is also enabled by the interface for Fragmenter - no "smarter" alternatives to the simple one have been implemented as yet though (as far as I am aware). Cheers Mark ___ Win a BlackBerry device from O2 with Yahoo!. Enter now. http://www.yahoo.co.uk/blackberry - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: highlight - scoring fragments with more of the same token
markharw00d <[EMAIL PROTECTED]> wrote on 26/09/2006 00:11:12: > If you were to score repeated terms then I suspect it would have to be > done so that the repetitions didn't score as highly as the first > occurrence - otherwise f2 could be selected as a better fragment than f3 > for the query q1 in your example. > Repetitions of a term in a fragment could be scored as a very small > fraction of the score given to the first occurrence. This would at least > rank f2 higher than f1 for query q2. > Another potentially useful ranking factor may be to boost fragments > found at the beginning of a document - that's where people tend to write > summaries or introductions. Yes, it makes sense to add these heuristics. I was somewhat surprised to find that highlighting scoring simply counts how many unique query terms appear in the fragment. Guess was expecting a more similarity like ranking of fragments - something that would perhaps have tf related to the frequency of a term in a fragment, and idf related to the frequency of the term in the entire text. Idf would be meaningless for a single term query. Possibly, idf could relate to "iff" ~ inverse number of fragments containing the term. I am not sure if this is worth the effort, but it seems more correct...? Another thing I saw is that Highlighter seems to break the text arbitrarily by max-fragment-size, so for text: 1 2 x 4 a b x d y B C D if it happens to be broken into 4 tokens fragments, for query "x y" result would be: 1 2 x 4 - score 1 a b x d - score 1 y B C D - score 1 and the first fragment would be selected 'best', although the fragment "x d y B" that appears in that text is better. Again, not sure if this is worth the effort - having overlapping between candidate fragments - just something to think about. > > > Doron Cohen wrote: > > This question was raised in the user's list - > > http://www.nabble.com/highlighting-tf2322109.html > > > > Assume three fragments and two queries: > > f1 = aa 11 bb 33 cc > > f2 = aa 11 bb 11 cc > > f3 = aa 11 bb 22 cc > > q1 = 11 22 > > q2 = 11 > > Now we call highlighter.getBestFragment(q); > > For q1, f3 is returned, as expected. > > For q2, f1 is returned, although "11" appears twice in f2 but only once in > > f1. > > > > This is because QueryScorer.getTokenScore(Token) counts only unique > > fragment tokens. > > > > Would it make sense to make this behavior controllable? > > (It is easily done but I am not sure about the consequences.) > > > > Or perhaps there is a way to achieve this behavior (preferring f2 on f1 for > > q2 above) that I missed? > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > ___ > Copy addresses and emails from any email account to Yahoo! Mail - > quick, easy and free. http://uk.docs.yahoo.com/trueswitch2.html > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-676) Promote solr's PrefixFilter into Java Lucene's core
[ http://issues.apache.org/jira/browse/LUCENE-676?page=all ] Andi Vajda updated LUCENE-676: -- Attachment: TestPrefixFilter.java Here is another attachment by Yura providing the request unit test. > Promote solr's PrefixFilter into Java Lucene's core > --- > > Key: LUCENE-676 > URL: http://issues.apache.org/jira/browse/LUCENE-676 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.0.1 >Reporter: Andi Vajda >Priority: Trivial > Attachments: PrefixFilter.java, TestPrefixFilter.java > > > Solr's PrefixFilter class is not specific to Solr and seems to be of interest > to core lucene users (PyLucene in this case). > Promoting it into the Lucene core would be helpful. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-636) [PATCH] Differently configured Lucene 'instances' in same JVM
[ http://issues.apache.org/jira/browse/LUCENE-636?page=comments#action_12437789 ] Johan Stuyts commented on LUCENE-636: - I just found out that the patch is incomplete. You can only specify the subclass of the SegmentReader class, but not the subclass of the MultiReader class. If your index contains multiple segments a MultiReader instead of the specified subclass of SegmentReader is created, and it is not possible to cast the returned IndexReader to the subclass of SegmentReader you specified in the LuceneConfig object. > [PATCH] Differently configured Lucene 'instances' in same JVM > - > > Key: LUCENE-636 > URL: http://issues.apache.org/jira/browse/LUCENE-636 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Johan Stuyts > Attachments: Lucene2DifferentConfigurations.patch > > > Currently Lucene can be configured using system properties. When running > multiple 'instances' of Lucene for different purposes in the same JVM, it is > not possible to use different settings for each 'instance'. > I made changes to some Lucene classes so you can pass a configuration to that > class. The Lucene 'instance' will use the settings from that configuration. > The changes do not effect the API and/or the current behavior so are > backwards compatible. > In addition to the changes above I also made the SegmentReader and > SegmentTermDocs extensible outside of their package. I would appreciate the > inclusion of these changes but don't mind creating a separate issue for them. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-676) Promote solr's PrefixFilter into Java Lucene's core
[ http://issues.apache.org/jira/browse/LUCENE-676?page=comments#action_12437757 ] Hoss Man commented on LUCENE-676: - Even though i use PrefixFilter on a daily basis in Solr, and i am confident of it's correctness, I don't think anything should be commited/promoted to the Lucene code base without some Unit Tests. (PrefixFilter is exercised by a few tests in the Solr code base at the moment but they aren't portable because they go through the SolrCore) > Promote solr's PrefixFilter into Java Lucene's core > --- > > Key: LUCENE-676 > URL: http://issues.apache.org/jira/browse/LUCENE-676 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.0.1 >Reporter: Andi Vajda >Priority: Trivial > Attachments: PrefixFilter.java > > > Solr's PrefixFilter class is not specific to Solr and seems to be of interest > to core lucene users (PyLucene in this case). > Promoting it into the Lucene core would be helpful. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: highlight - scoring fragments with more of the same token
If you were to score repeated terms then I suspect it would have to be done so that the repetitions didn't score as highly as the first occurrence - otherwise f2 could be selected as a better fragment than f3 for the query q1 in your example. Repetitions of a term in a fragment could be scored as a very small fraction of the score given to the first occurrence. This would at least rank f2 higher than f1 for query q2. Another potentially useful ranking factor may be to boost fragments found at the beginning of a document - that's where people tend to write summaries or introductions. Doron Cohen wrote: This question was raised in the user's list - http://www.nabble.com/highlighting-tf2322109.html Assume three fragments and two queries: f1 = aa 11 bb 33 cc f2 = aa 11 bb 11 cc f3 = aa 11 bb 22 cc q1 = 11 22 q2 = 11 Now we call highlighter.getBestFragment(q); For q1, f3 is returned, as expected. For q2, f1 is returned, although "11" appears twice in f2 but only once in f1. This is because QueryScorer.getTokenScore(Token) counts only unique fragment tokens. Would it make sense to make this behavior controllable? (It is easily done but I am not sure about the consequences.) Or perhaps there is a way to achieve this behavior (preferring f2 on f1 for q2 above) that I missed? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___ Copy addresses and emails from any email account to Yahoo! Mail - quick, easy and free. http://uk.docs.yahoo.com/trueswitch2.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]