[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index

2008-07-19 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1340:
---

Attachment: LUCENE-1340.patch

Thanks eks, that was fast -- I think you set a new record!

The patch looks good, though we definitely need some solid unit tests
here.  I made some small (whitespace, spelling, naming) corrections 
attached a new rev of the patch.

One question I have: right now if a single field has mixed true/false
for omitTf, you set it to false, meaning we start storing the term
freq, pos, payloads again.  Can/should we do the reverse instead?  If
we did, we could make some further optimizations, eg right now we
consume RAM storing all positions/payloads on a field that has omitTF=true
on the possibility that we may stll see omitTf=false in the same session.

With this patch we still store the *.prx bytes for a field with
omitTf=true.  Can you fix that?  I think in FreqProxTermsWriter you
can simply not write any bytes to the proxOut; likewise in
SegmentMerger and SegmentTermPositions, don't try to read bytes from
the prx file if omitTf==true.

I'd also be curious about what gains in index size  filter
performance we see with these new boolean fields.


 Make it posible not to include TF information in index
 --

 Key: LUCENE-1340
 URL: https://issues.apache.org/jira/browse/LUCENE-1340
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Eks Dev
Priority: Minor
 Attachments: LUCENE-1340.patch, LUCENE-1340.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Term Frequency is typically not needed  for all fields, some CPU (reading one 
 VInt less and one X1...) and IO can be spared by making pure boolen fields 
 possible in Lucene. This topic has already been discussed and accepted as a 
 part of Flexible Indexing... This issue tries to push things a bit faster 
 forward as I have some concrete customer demands.
 benefits can be expected for fields that are typical candidates for Filters, 
 enumerations, user rights, IDs or very short texts, phone  numbers, zip 
 codes, names...
 Status: just passed standard test (compatibility), commited for early review, 
 I have not tried new feature, missing some asserts and one two unit tests
 Complexity: simpler than expected
 can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1337) [PATCH] improve searching under high concurrancy

2008-07-19 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12614971#action_12614971
 ] 

Michael McCandless commented on LUCENE-1337:


bq. Yonik checked in a modification of FSDirectory into LUCENE-753. I took that 
code and made NIOFSDirectory which is standalone so that it can be committed. 
It is checked into LUCENE-753 as lucene-753.patch. 

OK.  I think (?) it's a good idea to separately offer an FSDirectory 
implementation that uses positional reads (via FileChannel) to avoid 
synchronization.

I'd also like to somehow make that implementation the default on those 
platforms (all except windows?) where there are clear concurrency gains.  Ie, 
maybe change FSDirectory.getDirectory to return NIOFSDirectory if it's not on 
windows, but also offer a getDirectory that takes the IMPL so you can force it 
to pick a different IMPL.  In general I think Lucene should default to good out 
of the box performance, ie, without requiring special knowledge/tuning on the 
user's part, so long as there's no difficult tradeoff.

Though we probably should change the name to something less generic than nio, 
though I can't think of an alternative offhand.

But one question: it looks like NIOFSIndexInput copies most of 
BufferedIndexInput source rather than subclassing -- why was that?  Can we 
change that back to a subclass, perhaps opening up members of 
BufferedIndexInput a bit if necessary?

 [PATCH] improve searching under high concurrancy
 

 Key: LUCENE-1337
 URL: https://issues.apache.org/jira/browse/LUCENE-1337
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.3.1
 Environment: Linux
Reporter: Brian Gardner
Priority: Minor
 Attachments: lucene.patch


 I was trying to load test my web server and kept running into a condition 
 were the web server would become unresponsive even though the load was below 
 one.  Turns out Lucene has synchronization blocks around reading the index.  
 It appears this was only necassary to synchronize access to a descriptor 
 which contains a RandomAccessFile and information about the state of this 
 file.  My solution was to use a pool of descriptors so that they could be 
 reused on subsequent reads.  During periods of low contention only one or a 
 few Descriptors will be created, but under heavy loads many Descriptors can 
 be created to avoid synchronization.  After creating and applying my patch, I 
 was able to triple my searching throughput and fully utilize the resources, 
 the CPU's becoming the new bottleneck.   My patch modifies FSDirectory 
 directly, but I'm not entirely sure that's the proper implementation.  I'd 
 like to help resolve this synchronization issue for other lucene users, so 
 please let me know how I can help.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput

2008-07-19 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12614974#action_12614974
 ] 

Michael McCandless commented on LUCENE-753:
---

I created a large index (indexed Wikipedia 4X times over, with stored
fields  tv offsets/positions = 72 GB).  I then randomly sampled 50
terms  1 million freq, plus 200 terms  100,000 freq plus 100 terms 
10,000 freq plus 100 terms  1000 freq.  Then I warmed the OS so these
queries are fully cached in the IO cache.

It's a highly synthetic test.  I'd really love to test on real
queries, instead of single term queries.

Then I ran this alg:

{code}
analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer

query.maker = org.apache.lucene.benchmark.byTask.feeds.FileBasedQueryMaker
file.query.maker.file = /lucene/wikiQueries.txt

directory=FSDirectory
pool=true

work.dir=/lucene/bigwork

OpenReader

{ Warmup SearchTrav(20)  : 5

{ Rounds
  [{ Search Search  : 500]: 16
  NewRound
}: 2

CloseReader 

RepSumByPrefRound Search
{code}

I ran with 2, 4, 8 and 16 threads, on a Intel quad Mac Pro (2 cpus,
each dual core) OS X 10.5.4, with 6 GB RAM, Sun JRE 1.6.0_05 and a
single WD Velociraptor hard drive.  To keep the number of searches
constant I changed the 500 count above to match (ie with 8 threads I
changed 500 - 1000, 4 threads I changed it to 2000, etc.).

Here're the results -- each run is best of 2, and all searches are
fully cached in OS's IO cache:

||Number of Threads||Patch rec/s||Trunk rec/s||Pctg gain||
|2|78.7|74.9|5.1%|
|4|74.1|68.2|8.7%|
|8|37.7|32.7|15.3%|
|16|19.2|16.3|17.8%|

I also ran the same alg, replacing Search task with SearchTravRet(10)
(retrieves the first 10 docs (hits) of each search), first warming so
it's all fully cached:

||Number of Threads||Patch rec/s||Trunk rec/s||Pctg gain||
|2|1589.6|1519.8|4.6%|
|4|1460.9|1395.3|4.7%|
|8|748.9|676.0|10.8%|
|16|382.7|338.4|13.1%|

So there are smallish gains, but rememember these are upper bounds on
the gains because no pooling is happening.  I'll test uncached next.


 Use NIO positional read to avoid synchronization in FSIndexInput
 

 Key: LUCENE-753
 URL: https://issues.apache.org/jira/browse/LUCENE-753
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Reporter: Yonik Seeley
 Attachments: FileReadTest.java, FileReadTest.java, FileReadTest.java, 
 FileReadTest.java, FileReadTest.java, FileReadTest.java, FileReadTest.java, 
 FSDirectoryPool.patch, FSIndexInput.patch, FSIndexInput.patch, 
 lucene-753.patch, lucene-753.patch


 As suggested by Doug, we could use NIO pread to avoid synchronization on the 
 underlying file.
 This could mitigate any MT performance drop caused by reducing the number of 
 files in the index format.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index

2008-07-19 Thread Eks Dev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1340:


Attachment: LUCENE-1340.patch

Thanks Mike, with just a little bit more hand-holding we are going to be there 
:)
 
I *think* I have *.prx IO excluded in case omitTf==true, please have a look, 
this part is really not an easy one (*Merger).

Also, now if a single field has mixed true/false for omitTf, I set it to true.

One unit test is already there, basic use case works, but the test has to cover 
a bit more



 Make it posible not to include TF information in index
 --

 Key: LUCENE-1340
 URL: https://issues.apache.org/jira/browse/LUCENE-1340
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Eks Dev
Priority: Minor
 Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Term Frequency is typically not needed  for all fields, some CPU (reading one 
 VInt less and one X1...) and IO can be spared by making pure boolen fields 
 possible in Lucene. This topic has already been discussed and accepted as a 
 part of Flexible Indexing... This issue tries to push things a bit faster 
 forward as I have some concrete customer demands.
 benefits can be expected for fields that are typical candidates for Filters, 
 enumerations, user rights, IDs or very short texts, phone  numbers, zip 
 codes, names...
 Status: just passed standard test (compatibility), commited for early review, 
 I have not tried new feature, missing some asserts and one two unit tests
 Complexity: simpler than expected
 can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1339) Add IndexReader.acquire() and release() methods using IndexReader's ref counting

2008-07-19 Thread Andi Vajda (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12615001#action_12615001
 ] 

Andi Vajda commented on LUCENE-1339:


That would work just as well !
Andi..


 Add IndexReader.acquire() and release() methods using IndexReader's ref 
 counting
 

 Key: LUCENE-1339
 URL: https://issues.apache.org/jira/browse/LUCENE-1339
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Andi Vajda
 Fix For: 2.3.2

 Attachments: lucene-1339.patch


 From: 
 http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200807.mbox/[EMAIL 
 PROTECTED]
 I have a server where a bunch of threads are handling search requests. I
 have a another process that updates the index used by the search server and
 that asks the searcher server to reopen its index reader after the updates
 completed.
 When I reopen() the index reader, I also close the old one (if the reopen()
 yielded a new instance). This causes problems for the other threads that
 are currently in the middle of a search request.
 I'd like to propose the addition of two methods, acquire() and release() 
 (attached to this bug report), that increment/decrement the ref count that 
 IndexReader 
 instances currently maintain for related purposes. That ref count prevents 
 the index reader from being actually closed until it reaches zero.
 My server's search threads, thus acquiring and releasing the index reader 
 can be sure that the index reader they're currently using is good until 
 they're done with the current request, ie, until they release() it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexReader.acquire()/release() ?

2008-07-19 Thread Andi Vajda


On Fri, 18 Jul 2008, Yonik Seeley wrote:


Although I do wonder if incRef() and decRef() aren't more suitable
names.  Just make those methods public, which the caveat that one
should not call them on a closed reader.  They are expert level APIs
after all.


That would work just as well. The acquire() and release() methods were 
intended to do exactly that (and check that the index is still open).


Andi..

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1341) BoostingNearQuery class (prototype)

2008-07-19 Thread Peter Keegan (JIRA)
BoostingNearQuery class (prototype)
---

 Key: LUCENE-1341
 URL: https://issues.apache.org/jira/browse/LUCENE-1341
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Affects Versions: 2.3.1
Reporter: Peter Keegan
Priority: Minor
 Fix For: 2.3.2


This patch implements term boosting for SpanNearQuery. Refer to: 
http://www.gossamer-threads.com/lists/lucene/java-user/62779

This patch works but probably needs more work. I don't like the use of 
'instanceof', but I didn't want to touch Spans or TermSpans. Also, the payload 
code is mostly a copy of what's in BoostingTermQuery and could be 
common-sourced somewhere. Feel free to throw darts at it :)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1341) BoostingNearQuery class (prototype)

2008-07-19 Thread Peter Keegan (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Keegan updated LUCENE-1341:
-

Attachment: BoostingNearQuery.java
bnq.patch

 BoostingNearQuery class (prototype)
 ---

 Key: LUCENE-1341
 URL: https://issues.apache.org/jira/browse/LUCENE-1341
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Affects Versions: 2.3.1
Reporter: Peter Keegan
Priority: Minor
 Fix For: 2.3.2

 Attachments: bnq.patch, BoostingNearQuery.java


 This patch implements term boosting for SpanNearQuery. Refer to: 
 http://www.gossamer-threads.com/lists/lucene/java-user/62779
 This patch works but probably needs more work. I don't like the use of 
 'instanceof', but I didn't want to touch Spans or TermSpans. Also, the 
 payload code is mostly a copy of what's in BoostingTermQuery and could be 
 common-sourced somewhere. Feel free to throw darts at it :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1341) BoostingNearQuery class (prototype)

2008-07-19 Thread Peter Keegan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12615004#action_12615004
 ] 

Peter Keegan commented on LUCENE-1341:
--

Note that this patch requires java 1.5 or later (easily modified to run on 1.4)

 BoostingNearQuery class (prototype)
 ---

 Key: LUCENE-1341
 URL: https://issues.apache.org/jira/browse/LUCENE-1341
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Affects Versions: 2.3.1
Reporter: Peter Keegan
Priority: Minor
 Fix For: 2.3.2

 Attachments: bnq.patch, BoostingNearQuery.java


 This patch implements term boosting for SpanNearQuery. Refer to: 
 http://www.gossamer-threads.com/lists/lucene/java-user/62779
 This patch works but probably needs more work. I don't like the use of 
 'instanceof', but I didn't want to touch Spans or TermSpans. Also, the 
 payload code is mostly a copy of what's in BoostingTermQuery and could be 
 common-sourced somewhere. Feel free to throw darts at it :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]