[jira] Updated: (LUCENE-665) temporary file access denied on Windows

2006-09-26 Thread Doron Cohen (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-665?page=all ]

Doron Cohen updated LUCENE-665:
---

Attachment: FSWinDirectory_26_Sep_06.patch

Updated the patch according to review comments by Hoss, plus:
- protect currMillis usage from system clock modifications.
- all Win specific code in a single Java file with two inner classes, for 
"cleaner" javadocs (now waitForRetry() is provate). 

Tested as previous patch: 
- "ant test" passes with new code. 
- For test, modified build-common.xml to set a system property so that the new 
WinFS class was always in effect and ran the tests - all passed. 
- my stress test TestinterleavedAddAndRemoves fails in my env by default and 
passes when FSWinDirectory is in effect. 


> temporary file access denied on Windows
> ---
>
> Key: LUCENE-665
> URL: http://issues.apache.org/jira/browse/LUCENE-665
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Store
>Affects Versions: 2.0.0
> Environment: Windows
>Reporter: Doron Cohen
> Attachments: FSDirectory_Retry_Logic.patch, 
> FSDirs_Retry_Logic_3.patch, FSWinDirectory.patch, 
> FSWinDirectory_26_Sep_06.patch, Test_Output.txt, 
> TestInterleavedAddAndRemoves.java
>
>
> When interleaving adds and removes there is frequent opening/closing of 
> readers and writers. 
> I tried to measure performance in such a scenario (for issue 565), but the 
> performance test failed  - the indexing process crashed consistently with 
> file "access denied" errors - "cannot create a lock file" in 
> "lockFile.createNewFile()" and "cannot rename file".
> This is related to:
> - issue 516 (a closed issue: "TestFSDirectory fails on Windows") - 
> http://issues.apache.org/jira/browse/LUCENE-516 
> - user list questions due to file errors:
>   - 
> http://www.nabble.com/OutOfMemory-and-IOException-Access-Denied-errors-tf1649795.html
>   - 
> http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html
> - discussion on lock-less commits 
> http://www.nabble.com/Lock-less-commits-tf2126935.html
> My test setup is: XP (SP1), JAVA 1.5 - both SUN and IBM SDKs. 
> I noticed that the problem is more frequent when locks are created on one 
> disk and the index on another. Both are NTFS with Windows indexing service 
> enabled. I suspect this indexing service might be related - keeping files 
> busy for a while, but don't know for sure.
> After experimenting with it I conclude that these problems - at least in my 
> scenario - are due to a temporary situation - the FS, or the OS, is 
> *temporarily* holding references to files or folders, preventing from 
> renaming them, deleting them, or creating new files in certain directories. 
> So I added to FSDirectory a retry logic in cases the error was related to 
> "Access Denied". This is the same approach brought in 
> http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html
>  - there, in addition to the retry, gc() is invoked (I did not gc()). This is 
> based on the *hope* that a access-denied situation would vanish after a small 
> delay, and the retry would succeed.
> I modified FSDirectory this way for "Access Denied" errors during creating a 
> new files, renaming a file.
> This worked fine for me. The performance test that failed before, now managed 
> to complete. There should be no performance implications due to this 
> modification, because only the cases that would otherwise wrongly fail are 
> now delaying some extra millis and retry.
> I am attaching here a patch - FSDirectory_Retry_Logic.patch - that has these 
> changes to FSDirectory. 
> All "ant test" tests pass with this patch.
> Also attaching a test case that demostrates the problem - at least on my 
> machine. There two tests cases in that test file - one that works in system 
> temp (like most Lucene tests) and one that creates the index in a different 
> disk. The latter case can only run if the path ("D:" , "tmp") is valid.
> It would be great if people that experienced these problems could try out 
> this patch and comment whether it made any difference for them. 
> If it turns out useful for others as well, including this patch in the code 
> might help to relieve some of those "frustration" user cases.
> A comment on state of proposed patch: 
> - It is not a "ready to deploy" code - it has some debug printing, showing 
> the cases that the "retry logic" actually took place. 
> - I am not sure if current 30ms is the right delay... why not 50ms? 10ms? 
> This is currently defined by a constant.
> - Should a call to gc() be added? (I think not.)
> - Should the retry be attempted also on "non access-denied" exceptions? (I 
> think not).
> - I feel it is somewhat "woodoo programming", but though I don't like it, it 
> seems to work... 
> Attached files:
> 1. TestInterleave

Re: highlight - scoring fragments with more of the same token

2006-09-26 Thread Chris Hostetter

: TF is not a factor in fragment scores because I found its typically more
: useful to look for fragments containing a strong mix of the query terms
: - not merely repetitions of the same term. The idea is the choice of
: scorer is pluggable if you don't like the default behaviour.

Taking a "coord" factor into consideration in that case may help balance
out the benefits of tf weighting vs mixed terms.  (myabe the default
highlighting options already do that, i'm not sure ... just tossing it out
as a comment from the peanut gallery)



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: highlight - scoring fragments with more of the same token

2006-09-26 Thread markharw00d

I was somewhat surprised to find that highlighting scoring simply counts
how many unique query terms appear in the fragment. Guess was expecting a


See QueryScorer(Query query, IndexReader reader, String fieldName) constructor 
- this will factor IDF into weighting for terms. Query boosts are automatically 
factored in too.
TF is not a factor in fragment scores because I found its typically more useful 
to look for fragments containing a strong mix of the query terms - not merely 
repetitions of the same term. The idea is the choice of scorer is pluggable if 
you don't like the default behaviour.

The possibility of adding smarter fragmenting is also enabled by the interface for 
Fragmenter - no "smarter" alternatives to the simple one have been implemented 
as yet though (as far as I am aware).

Cheers
Mark





___ 
Win a BlackBerry device from O2 with Yahoo!. Enter now. http://www.yahoo.co.uk/blackberry



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: highlight - scoring fragments with more of the same token

2006-09-26 Thread Doron Cohen

markharw00d <[EMAIL PROTECTED]> wrote on 26/09/2006 00:11:12:
> If you were to score repeated terms then I suspect it would have to be
> done so that the repetitions didn't score as highly as the first
> occurrence - otherwise f2 could be selected as a better fragment than f3
> for the query q1 in your example.
> Repetitions of a term in a fragment could be scored as a very small
> fraction of the score given to the first occurrence. This would at least
> rank  f2 higher than f1 for query q2.
> Another potentially useful ranking factor may be to boost fragments
> found at the beginning of a document - that's where people tend to write
> summaries or introductions.

Yes, it makes sense to add these heuristics.

I was somewhat surprised to find that highlighting scoring simply counts
how many unique query terms appear in the fragment. Guess was expecting a
more similarity like ranking of fragments - something that would perhaps
have tf related to the frequency of a term in a fragment, and idf related
to the frequency of the term in the entire text. Idf would be meaningless
for a single term query. Possibly, idf could relate to "iff" ~ inverse
number of fragments containing the term. I am not sure if this is worth the
effort, but it seems more correct...?

Another thing I saw is that Highlighter seems to break the text arbitrarily
by max-fragment-size, so for text:
  1 2 x 4 a b x d y B C D
if it happens to be broken into 4 tokens fragments, for query "x y" result
would be:
  1 2 x 4 - score 1
  a b x d - score 1
  y B C D - score 1
and the first fragment would be selected 'best', although the fragment "x d
y B" that appears in that text is better. Again, not sure if this is worth
the effort - having overlapping between candidate fragments - just
something to think about.

>
>
> Doron Cohen wrote:
> > This question was raised in the user's list -
> > http://www.nabble.com/highlighting-tf2322109.html
> >
> > Assume three fragments and two queries:
> >   f1 = aa  11  bb  33  cc
> >   f2 = aa  11  bb  11  cc
> >   f3 = aa  11  bb  22  cc
> >   q1 = 11 22
> >   q2 = 11
> > Now we call highlighter.getBestFragment(q);
> > For q1, f3 is returned, as expected.
> > For q2, f1 is returned, although "11" appears twice in f2 but only once
in
> > f1.
> >
> > This is because QueryScorer.getTokenScore(Token) counts only unique
> > fragment tokens.
> >
> > Would it make sense to make this behavior controllable?
> > (It is easily done but I am not sure about the consequences.)
> >
> > Or perhaps there is a way to achieve this behavior (preferring f2 on f1
for
> > q2 above) that I missed?
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> >
>
>
>
>
> ___
> Copy addresses and emails from any email account to Yahoo! Mail -
> quick, easy and free. http://uk.docs.yahoo.com/trueswitch2.html
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-676) Promote solr's PrefixFilter into Java Lucene's core

2006-09-26 Thread Andi Vajda (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-676?page=all ]

Andi Vajda updated LUCENE-676:
--

Attachment: TestPrefixFilter.java

Here is another attachment by Yura providing the request unit test.

> Promote solr's PrefixFilter into Java Lucene's core
> ---
>
> Key: LUCENE-676
> URL: http://issues.apache.org/jira/browse/LUCENE-676
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Andi Vajda
>Priority: Trivial
> Attachments: PrefixFilter.java, TestPrefixFilter.java
>
>
> Solr's PrefixFilter class is not specific to Solr and seems to be of interest 
> to core lucene users (PyLucene in this case).
> Promoting it into the Lucene core would be helpful.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-636) [PATCH] Differently configured Lucene 'instances' in same JVM

2006-09-26 Thread Johan Stuyts (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-636?page=comments#action_12437789 ] 

Johan Stuyts commented on LUCENE-636:
-

I just found out that the patch is incomplete. You can only specify the 
subclass of the SegmentReader class, but not the subclass of the MultiReader 
class. If your index contains multiple segments a MultiReader instead of the 
specified subclass of SegmentReader is created, and it is not possible to cast 
the returned IndexReader to the subclass of SegmentReader you specified in the 
LuceneConfig object.


> [PATCH] Differently configured Lucene 'instances' in same JVM
> -
>
> Key: LUCENE-636
> URL: http://issues.apache.org/jira/browse/LUCENE-636
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Johan Stuyts
> Attachments: Lucene2DifferentConfigurations.patch
>
>
> Currently Lucene can be configured using system properties. When running 
> multiple 'instances' of Lucene for different purposes in the same JVM, it is 
> not possible to use different settings for each 'instance'.
> I made changes to some Lucene classes so you can pass a configuration to that 
> class. The Lucene 'instance' will use the settings from that configuration. 
> The changes do not effect the API and/or the current behavior so are 
> backwards compatible.
> In addition to the changes above I also made the SegmentReader and 
> SegmentTermDocs extensible outside of their package. I would appreciate the 
> inclusion of these changes but don't mind creating a separate issue for them.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-676) Promote solr's PrefixFilter into Java Lucene's core

2006-09-26 Thread Hoss Man (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-676?page=comments#action_12437757 ] 

Hoss Man commented on LUCENE-676:
-

Even though i use PrefixFilter on a daily basis in Solr, and i am confident of 
it's correctness, I don't think anything should be commited/promoted to the 
Lucene code base without some Unit Tests.

(PrefixFilter is exercised by a few tests in the Solr code base at the moment 
but they aren't portable because they go through the SolrCore)

> Promote solr's PrefixFilter into Java Lucene's core
> ---
>
> Key: LUCENE-676
> URL: http://issues.apache.org/jira/browse/LUCENE-676
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Andi Vajda
>Priority: Trivial
> Attachments: PrefixFilter.java
>
>
> Solr's PrefixFilter class is not specific to Solr and seems to be of interest 
> to core lucene users (PyLucene in this case).
> Promoting it into the Lucene core would be helpful.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: highlight - scoring fragments with more of the same token

2006-09-26 Thread markharw00d
If you were to score repeated terms then I suspect it would have to be 
done so that the repetitions didn't score as highly as the first 
occurrence - otherwise f2 could be selected as a better fragment than f3 
for the query q1 in your example.
Repetitions of a term in a fragment could be scored as a very small 
fraction of the score given to the first occurrence. This would at least 
rank  f2 higher than f1 for query q2.
Another potentially useful ranking factor may be to boost fragments 
found at the beginning of a document - that's where people tend to write 
summaries or introductions.



Doron Cohen wrote:

This question was raised in the user's list -
http://www.nabble.com/highlighting-tf2322109.html

Assume three fragments and two queries:
  f1 = aa  11  bb  33  cc
  f2 = aa  11  bb  11  cc
  f3 = aa  11  bb  22  cc
  q1 = 11 22
  q2 = 11
Now we call highlighter.getBestFragment(q);
For q1, f3 is returned, as expected.
For q2, f1 is returned, although "11" appears twice in f2 but only once in
f1.

This is because QueryScorer.getTokenScore(Token) counts only unique
fragment tokens.

Would it make sense to make this behavior controllable?
(It is easily done but I am not sure about the consequences.)

Or perhaps there is a way to achieve this behavior (preferring f2 on f1 for
q2 above) that I missed?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



  





___ 
Copy addresses and emails from any email account to Yahoo! Mail - quick, easy and free. http://uk.docs.yahoo.com/trueswitch2.html



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]