[jira] Commented: (LUCENE-666) TERM1 OR NOT TERM2 does not perform as expected

2006-08-31 Thread Steven Parkes (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-666?page=comments#action_12431921 ] 

Steven Parkes commented on LUCENE-666:
--

I think the idea of a parse exception is good. The issue is relatively subtle.  
A user can relatively easily generate a query that ignores terms. Doing that 
silently is just asking for people to spend a lot of time trying to understand 
why they aren't getting something they expect.

The question is whether it's feasible to detect that an invalid query has been 
entered. The problem is of course having to ever enumerate all documents. But 
with multilevel queries (sums of products of sums, etc.), it's not immediately 
obvious from the parse tree whether that is  being attempted or not.

a(b  + !c) is okay but (a+!d)(b + !c) isn't. Basically you can't do a 
term-by-term analysis of a multilevel query and decide. It's a global property 
of the query.

It one reduces the multilevel query to a two level sum-of-products form (i.e., 
distribute all AND operations over all OR operaitons), you can inspect the 
result. If any term does not contain a positive term, the query is invalid. But 
going from multilevel to two level can generate an exponential number of terms. 
Maybe not an issue in IR, but it makes me nervous.

I would think that somewhere between query parsing and search execution, the 
invalidity of a query would have to pop out but I haven't delved into this path 
enough to know where that might happen.

If no one's looking at this (generating an exception somewhere) and no one 
knows of any reason it's not possible, I'll look ...

On a related topic, and maybe a naive question, has anyone considered keeping a 
posting that contains all documents? Is it really that hard to do in Lucene 
today? That would remove this restriction. Perhaps the issue then is that the 
combination of the number of valid use cases combined with the number of times 
it would cause people trouble because they were using it invalidly means it's 
not a good idea? I can come up with some potentially interesting queries, 
especially when you combine with other features like facets. "Tell me the facet 
breakdown of all documents that don't contain the word microsoft ..." In many 
places, !microsoft is not such a terribly big set and can be computed with 
about the same effort that microsoft can. (Is that true?)

> TERM1 OR NOT TERM2 does not perform as expected
> ---
>
> Key: LUCENE-666
> URL: http://issues.apache.org/jira/browse/LUCENE-666
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: QueryParser
>Affects Versions: 2.0.0
> Environment: Windows XP, JavaCC 4.0, JDK 1.5
>Reporter: Dejan Nenov
> Attachments: TestAornotB.java
>
>
> test:
> [junit] Testsuite: org.apache.lucene.search.TestAornotB
> [junit] Tests run: 3, Failures: 1, Errors: 0, Time elapsed: 0.39 sec
> [junit] - Standard Output ---
> [junit] Doc1 = A B C
> [junit] Doc2 = A B C D
> [junit] Doc3 = A   C D
> [junit] Doc4 =   B C D
> [junit] Doc5 = C D
> [junit] -
> [junit] With query "A OR NOT B" we expect to hit
> [junit] all documents EXCEPT Doc4, instead we only match on Doc3.
> [junit] While LUCENE currently explicitly does not support queries of
> [junit] the type "find docs that do not contain TERM" - this explains
> [junit] not finding Doc5, but does not justify elimnating Doc1 and Doc2
> [junit] -
> [junit]  the fix shoould likely require a modification to QueryParser.jj
> [junit]  around the method:
> [junit]  protected void addClause(Vector clauses, int conj, int mods, 
> Query q)
> [junit] Query:c:a -c:b hits.length=1
> [junit] Query Found:Doc[0]= A C D
> [junit] 0.0 = (NON-MATCH) Failure to meet condition(s) of 
> required/prohibited clause(s)
> [junit]   0.6115718 = (MATCH) fieldWeight(c:a in 1), product of:
> [junit] 1.0 = tf(termFreq(c:a)=1)
> [junit] 1.2231436 = idf(docFreq=3)
> [junit] 0.5 = fieldNorm(field=c, doc=1)
> [junit]   0.0 = match on prohibited clause (c:b)
> [junit] 0.6115718 = (MATCH) fieldWeight(c:b in 1), product of:
> [junit]   1.0 = tf(termFreq(c:b)=1)
> [junit]   1.2231436 = idf(docFreq=3)
> [junit]   0.5 = fieldNorm(field=c, doc=1)
> [junit] 0.6115718 = (MATCH) sum of:
> [junit]   0.6115718 = (MATCH) fieldWeight(c:a in 2), product of:
> [junit] 1.0 = tf(termFreq(c:a)=1)
> [junit] 1.2231436 = idf(docFreq=3)
> [junit] 0.5 = fieldNorm(field=c, doc=2)
> [junit] 0.0 = (NON-MATCH) Failure to meet condition(s) of 
> required/prohibited clause(s)
> [junit]   0.0 = match on prohibited clause (c:b)
> [junit] 0.6

[jira] Commented: (LUCENE-664) [PATCH] small fixes to the new scoring.html doc

2006-08-31 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-664?page=comments#action_12431954 ] 

Doug Cutting commented on LUCENE-664:
-

Javadoc is easiest to keep in sync with changes to the code, since it is in the 
same files.  The wiki is the hardest to keep in sync with the code, since it is 
not versioned with the code.  We website is somewhere between: it is in 
subversion, but in separate files in a separate tree.

The javadoc is thus the preferred location for documentation that is specific 
to the code.  The website and wiki are better for stuff that's specific to the 
project: policys, procedures, etc.  The wiki is great for user-generated stuff 
like benchmarks, porting tricks, use cases, etc.

This stuff seems pretty closely tied to the code, so I'd put it in the javadoc. 
 It's nearly all stuff that's in the search package, so much of this could go 
in search/package.html with pointers to the javadoc for Query, Weight, Scorer, 
etc.

> [PATCH] small fixes to the new scoring.html doc
> ---
>
> Key: LUCENE-664
> URL: http://issues.apache.org/jira/browse/LUCENE-664
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Website
>Affects Versions: 2.0.1
>Reporter: Michael McCandless
> Attachments: lucene.uxf, scoring-small-fixes.patch, 
> scoring-small-fixes2.patch, scoring-small-fixes3.patch
>
>
> This is an awesome initiative.  We need more docs that cleanly explain the 
> inner workings of Lucene in general... thanks Grant & Steve & others!
> I have a few small initial proposed fixes, largely just adding some more 
> description around the components of the formula.  But also a couple typos, 
> another link out to Wikipedia, a missing closing ), etc.  I've only made it 
> through the "Understanding the Scoring Formula" section so far.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: [jira] Commented: (LUCENE-664) [PATCH] small fixes to the new scoring.html doc

2006-08-31 Thread Steven Parkes
For my part, I was thinking more about material along the lines of 1.5
(Understanding the core indexing classes) and 1.6 (Understanding the
core searching classes) in LIA. It strikes me that javadocs are little
hard-core for that. A little difficult for the beginner to absorb. More
reference than tutorial?

Assuming then that we do have material that shouldn't go into the
javadocs, where should it go? As far as I can tell, material like that
in LIA doesn't exist right now, at least not on an apache site. Maybe I
missed it? 

I thought about the versioning issue with the wiki and I failed to come
up with a solution. Versioning the wiki seems to not make sense. I can
see the reasoning behind putting tutorial material in the website, but I
would so miss WikiWords.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-667) javacc skeleton files not regenerated

2006-08-31 Thread Steven Parkes (JIRA)
javacc skeleton files not regenerated
-

 Key: LUCENE-667
 URL: http://issues.apache.org/jira/browse/LUCENE-667
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Steven Parkes
Priority: Minor
 Attachments: javacc.patch

Copies of the the character stream files for javacc are checked into svn. These 
files were generated under javacc 3.0 (at least that's what they say, though 
javacc 3.2 says this too). javacc 4 complains that they are out of date but 
won't replace them; they must be removed before it will regenerate them.

There is one side effect of removing them: local changes are lost.  r387550 
removed a couple of deprecated methods. By using the files as generated by 
javacc, these deprecated  methods will be readded (at least until the javacc 
team removes them totally). There are other changes being made to the stream 
files, so I woudl think it's better to live with them unmodified than to keep 
local versions just for this change.

If we want javacc to recreate the files, the attached patch will remove them 
before running javacc.

All the tests pass using both javacc3.2 and 4.0.




-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-665) temporary file access denied on Windows

2006-08-31 Thread Michael McCandless (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-665?page=comments#action_12432004 ] 

Michael McCandless commented on LUCENE-665:
---

Wow!  Fantastic sleuthing.  I never would have guessed that.

> temporary file access denied on Windows
> ---
>
> Key: LUCENE-665
> URL: http://issues.apache.org/jira/browse/LUCENE-665
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Store
>Affects Versions: 2.0.0
> Environment: Windows
>Reporter: Doron Cohen
> Attachments: FSDirectory_Retry_Logic.patch, 
> FSDirs_Retry_Logic_3.patch, Test_Output.txt, TestInterleavedAddAndRemoves.java
>
>
> When interleaving adds and removes there is frequent opening/closing of 
> readers and writers. 
> I tried to measure performance in such a scenario (for issue 565), but the 
> performance test failed  - the indexing process crashed consistently with 
> file "access denied" errors - "cannot create a lock file" in 
> "lockFile.createNewFile()" and "cannot rename file".
> This is related to:
> - issue 516 (a closed issue: "TestFSDirectory fails on Windows") - 
> http://issues.apache.org/jira/browse/LUCENE-516 
> - user list questions due to file errors:
>   - 
> http://www.nabble.com/OutOfMemory-and-IOException-Access-Denied-errors-tf1649795.html
>   - 
> http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html
> - discussion on lock-less commits 
> http://www.nabble.com/Lock-less-commits-tf2126935.html
> My test setup is: XP (SP1), JAVA 1.5 - both SUN and IBM SDKs. 
> I noticed that the problem is more frequent when locks are created on one 
> disk and the index on another. Both are NTFS with Windows indexing service 
> enabled. I suspect this indexing service might be related - keeping files 
> busy for a while, but don't know for sure.
> After experimenting with it I conclude that these problems - at least in my 
> scenario - are due to a temporary situation - the FS, or the OS, is 
> *temporarily* holding references to files or folders, preventing from 
> renaming them, deleting them, or creating new files in certain directories. 
> So I added to FSDirectory a retry logic in cases the error was related to 
> "Access Denied". This is the same approach brought in 
> http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html
>  - there, in addition to the retry, gc() is invoked (I did not gc()). This is 
> based on the *hope* that a access-denied situation would vanish after a small 
> delay, and the retry would succeed.
> I modified FSDirectory this way for "Access Denied" errors during creating a 
> new files, renaming a file.
> This worked fine for me. The performance test that failed before, now managed 
> to complete. There should be no performance implications due to this 
> modification, because only the cases that would otherwise wrongly fail are 
> now delaying some extra millis and retry.
> I am attaching here a patch - FSDirectory_Retry_Logic.patch - that has these 
> changes to FSDirectory. 
> All "ant test" tests pass with this patch.
> Also attaching a test case that demostrates the problem - at least on my 
> machine. There two tests cases in that test file - one that works in system 
> temp (like most Lucene tests) and one that creates the index in a different 
> disk. The latter case can only run if the path ("D:" , "tmp") is valid.
> It would be great if people that experienced these problems could try out 
> this patch and comment whether it made any difference for them. 
> If it turns out useful for others as well, including this patch in the code 
> might help to relieve some of those "frustration" user cases.
> A comment on state of proposed patch: 
> - It is not a "ready to deploy" code - it has some debug printing, showing 
> the cases that the "retry logic" actually took place. 
> - I am not sure if current 30ms is the right delay... why not 50ms? 10ms? 
> This is currently defined by a constant.
> - Should a call to gc() be added? (I think not.)
> - Should the retry be attempted also on "non access-denied" exceptions? (I 
> think not).
> - I feel it is somewhat "woodoo programming", but though I don't like it, it 
> seems to work... 
> Attached files:
> 1. TestInterleavedAddAndRemoves.java - the LONG test that fails on XP without 
> the patch and passes with the patch.
> 2. FSDirectory_Retry_Logic.patch
> 3. Test_Output.txt- output of the test with the patch, on my XP. Only the 
> createNewFile() case had to be bypassed in this test, but for another program 
> I also saw the renameFile() being bypassed.
> - Doron

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, se

Re: [jira] Commented: (LUCENE-664) [PATCH] small fixes to the new scoring.html doc

2006-08-31 Thread Chris Hostetter
: Javadoc is easiest to keep in sync with changes to the code, since it is
: in the same files.  The wiki is the hardest to keep in sync with the
: code, since it is not versioned with the code.  We website is somewhere
: between: it is in subversion, but in separate files in a separate tree.

the seperation of hte site in the tree certainly makes keeping them in
sync less easy then keeping the javadocs in sync -- but the fact that the
"site" at the moment of a release is bundled in with the release makes it
a pretty optimal place for things like this.

that said, it certainly seems like sections of the scoring.html might make
more sense in the javadocs, leaving the file to be more of a "skeleton" of
links out .. likewise this doc should be refrenced by more links *from*
javadocs of various classes.

if this was something i was building for work, i'd put all of this info
in a doc-files directory so it was completely bundled with both the source
code and any build of the javadocs -- but nobody i work with really agrees
with me that doc-files directories are cool either.




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: [jira] Commented: (LUCENE-664) [PATCH] small fixes to the new scoring.html doc

2006-08-31 Thread Chris Hostetter

: For my part, I was thinking more about material along the lines of 1.5
: (Understanding the core indexing classes) and 1.6 (Understanding the
: core searching classes) in LIA. It strikes me that javadocs are little
: hard-core for that. A little difficult for the beginner to absorb. More
: reference than tutorial?

perhaps the existing scoring.html document should be split up? .. leave
the Introduction, and most of the "Scoring" sections in the "site" (much
like query parser syntax and file formats are) and move some of the more
developer specific / class specific information from the "Query
Classes", "Changing SImilarity" and "Changing your Scoring" sections into
the javadocs?



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-08-31 Thread Doron Cohen (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-565?page=all ]

Doron Cohen updated LUCENE-565:
---

Attachment: perf-test-res2.JPG

Updated performance test results - perf-test-res2.JPG - in avarage, the new 
code is  *9*  times faster!

What have changed? - in previous test I forgot to set max-buffered-deletes. 

After fixing so, I removed the test cases with max-buffer of 5,000 and up, 
because they consumed too much memory, and added more practical (I think) cases 
of 2000 and 3000. 

Here is a textual summary of the data in the attached image:

max buf add/del 10  10   100  1000 2000  3000
iterations   1  10100   100  200
  300
adds/iteration 10  10 10  10   10   
   10
dels/iteration 55   555 
 5
orig time (sec)   0.13  0.869.57  8.8822.74  
44.01
new  time (sec)  0.20  0.95   1.74  1.302.16  
3.08
Improvement (sec)-0.07-0.09  7.83  7.5820.58  40.94
Improvement  (%) -55% -11%  82%  85%  90%  93%

Note: for the first two cases new code is slower by 11% and 55%, but this is a 
very short test case, - the absolute difference here is less than 100ms, 
comparing to the other cases, where the difference is measured in seconds and 
10's of seconds.

> Supporting deleteDocuments in IndexWriter (Code and Performance Results 
> Provided)
> -
>
> Key: LUCENE-565
> URL: http://issues.apache.org/jira/browse/LUCENE-565
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
> Attachments: IndexWriter.java, IndexWriter.July09.patch, 
> IndexWriter.patch, NewIndexModifier.July09.patch, NewIndexWriter.Aug23.patch, 
> NewIndexWriter.July18.patch, perf-test-res.JPG, perf-test-res2.JPG, 
> perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java
>
>
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.
> This is because the ramDirectory is flushed to disk whenever an IndexWriter
> is closed, causing a lot of small segments to be created on disk, which
> eventually need to be merged.
> We would like to propose a small API change to eliminate this problem. We
> are aware that this kind change has come up in discusions before. See
> http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
> . The difference this time is that we have implemented the change and
> tested its performance, as described below.
> API Changes
> ---
> We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
> Using this method, inserts and deletes can be interleaved using the same
> IndexWriter.
> Note that, with this change it would be very easy to add another method to
> IndexWriter for updating documents, allowing applications to avoid a
> separate delete and insert to update a document.
> Also note that this change can co-exist with the existing APIs for deleting
> documents using an IndexReader. But if our proposal is accepted, we think
> those APIs should probably be deprecated.
> Coding Changes
> --
> Coding changes are localized to IndexWriter. Internally, the new
> deleteDocuments() method works by buffering the terms to be deleted.
> Deletes are deferred until the ramDirectory is flushed to disk, either
> because it becomes full or because the IndexWriter is closed. Using Java
> synchronization, care is taken to ensure that an interleaved sequence of
> inserts and deletes for the same document are properly serialized.
> We have attached a modified version of IndexWriter in Release 1.9.1 with
> these changes. Only a few hundred lines of coding changes are needed. All
> changes are commented by "CHANGE". We have also attached a modified version
> of an example from Chapter 2.2 of Lucene in Action.
> Performance Results
> ---
> To test the performance our proposed changes, we ran some experiments using
> the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
> Xeon server running Linux. The disk storage was configured as RAID0 array
> with 5 drives. Before indexes were built, the input documents were parsed
> to remove the HTML from them (i.e., only the text was indexed). This was
> done to minimize the impact of parsing on performance. A simple
> WhitespaceAnalyzer was used during i