[jira] Commented: (LUCENE-1539) Improve Benchmark

2009-06-22 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722738#action_12722738
 ] 

Jason Rutherglen commented on LUCENE-1539:
--

Took a look at Lucene in Action at Borders and learned the -Dproperty passed in 
overrides what's in the build.xml.  

 Improve Benchmark
 -

 Key: LUCENE-1539
 URL: https://issues.apache.org/jira/browse/LUCENE-1539
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, 
 LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, 
 LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, 
 sortCollate2.py

   Original Estimate: 336h
  Remaining Estimate: 336h

 Benchmark can be improved by incorporating recent suggestions posted
 on java-dev. M. McCandless' Python scripts that execute multiple
 rounds of tests can either be incorporated into the codebase or
 converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1539) Improve Benchmark

2009-06-19 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721705#action_12721705
 ] 

Michael McCandless commented on LUCENE-1539:


Where are we assuming/requiring the path be relative?

 Improve Benchmark
 -

 Key: LUCENE-1539
 URL: https://issues.apache.org/jira/browse/LUCENE-1539
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, 
 LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, 
 LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, 
 sortCollate2.py

   Original Estimate: 336h
  Remaining Estimate: 336h

 Benchmark can be improved by incorporating recent suggestions posted
 on java-dev. M. McCandless' Python scripts that execute multiple
 rounds of tests can either be incorporated into the codebase or
 converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1539) Improve Benchmark

2009-06-18 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721586#action_12721586
 ] 

Jason Rutherglen commented on LUCENE-1539:
--

I think it would be convenient to allow passing in the data files' absolute 
path, instead of assuming they're in a relative path.  

 Improve Benchmark
 -

 Key: LUCENE-1539
 URL: https://issues.apache.org/jira/browse/LUCENE-1539
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, 
 LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, 
 LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, 
 sortCollate2.py

   Original Estimate: 336h
  Remaining Estimate: 336h

 Benchmark can be improved by incorporating recent suggestions posted
 on java-dev. M. McCandless' Python scripts that execute multiple
 rounds of tests can either be incorporated into the codebase or
 converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1539) Improve Benchmark

2009-06-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718899#action_12718899
 ] 

Michael McCandless commented on LUCENE-1539:


Thanks Jason, getting close:

  * Can you add contrib/benchmark/CHANGES entry?

  * The new source files need a copyright header

  * Can you remove the undeleteAll?  I don't think the
DeleteByPercentTask should do that.

  * Can you make its param a real percent, ie so DeleteByPercent(25)
deletes 25% of the remaining docs.

  * The random-pick is going to be too slow once too many docs are
deleted (I mentioned this above, too).  How to fix?


 Improve Benchmark
 -

 Key: LUCENE-1539
 URL: https://issues.apache.org/jira/browse/LUCENE-1539
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, 
 LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, 
 sortCollate2.py

   Original Estimate: 336h
  Remaining Estimate: 336h

 Benchmark can be improved by incorporating recent suggestions posted
 on java-dev. M. McCandless' Python scripts that execute multiple
 rounds of tests can either be incorporated into the codebase or
 converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1539) Improve Benchmark

2009-06-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718907#action_12718907
 ] 

Michael McCandless commented on LUCENE-1539:


bq. When existing deletes are over 50%, we loop through termdocs instead.

OK good, except it's deleting too aggressively when  50% deletions are already 
present (using nextBoolean()).  Can you change that to target a certain 
deletion rate?  Ie if you need to delete 20%, then do random.nextDouble()  
0.20 to do the delete?  But then I guess put a floor on that rate so that it 
doesn't get too slow on the tail?  It won't be perfectly random when it hits 
that tail but I think that's OK.


 Improve Benchmark
 -

 Key: LUCENE-1539
 URL: https://issues.apache.org/jira/browse/LUCENE-1539
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, 
 LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, 
 sortCollate2.py

   Original Estimate: 336h
  Remaining Estimate: 336h

 Benchmark can be improved by incorporating recent suggestions posted
 on java-dev. M. McCandless' Python scripts that execute multiple
 rounds of tests can either be incorporated into the codebase or
 converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1539) Improve Benchmark

2009-06-12 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718957#action_12718957
 ] 

Jason Rutherglen commented on LUCENE-1539:
--

The only small thing that came to mind is if the user decides to
subsequently (in the .alg) delete a lesser percentage of docs
than the what exists in the reader. Does that mean we should
undelete docs?

 Improve Benchmark
 -

 Key: LUCENE-1539
 URL: https://issues.apache.org/jira/browse/LUCENE-1539
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, 
 LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, 
 LUCENE-1539.patch, sortBench2.py, sortCollate2.py

   Original Estimate: 336h
  Remaining Estimate: 336h

 Benchmark can be improved by incorporating recent suggestions posted
 on java-dev. M. McCandless' Python scripts that execute multiple
 rounds of tests can either be incorporated into the codebase or
 converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1539) Improve Benchmark

2009-06-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718960#action_12718960
 ] 

Michael McCandless commented on LUCENE-1539:


I'd say we don't allow that now.  EG one can easily save  open a past commit 
point, with less deletions?

But maybe we should throw an exception if you attempt this, so you don't 
falsely think it worked.  I'll make that change.

 Improve Benchmark
 -

 Key: LUCENE-1539
 URL: https://issues.apache.org/jira/browse/LUCENE-1539
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, 
 LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, 
 LUCENE-1539.patch, sortBench2.py, sortCollate2.py

   Original Estimate: 336h
  Remaining Estimate: 336h

 Benchmark can be improved by incorporating recent suggestions posted
 on java-dev. M. McCandless' Python scripts that execute multiple
 rounds of tests can either be incorporated into the codebase or
 converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1539) Improve Benchmark

2009-06-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718961#action_12718961
 ] 

Michael McCandless commented on LUCENE-1539:


Or... maybe we should just do undeleteAll all that case?  I'll take that 
approach instead.

 Improve Benchmark
 -

 Key: LUCENE-1539
 URL: https://issues.apache.org/jira/browse/LUCENE-1539
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, 
 LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, 
 LUCENE-1539.patch, sortBench2.py, sortCollate2.py

   Original Estimate: 336h
  Remaining Estimate: 336h

 Benchmark can be improved by incorporating recent suggestions posted
 on java-dev. M. McCandless' Python scripts that execute multiple
 rounds of tests can either be incorporated into the codebase or
 converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1539) Improve Benchmark

2009-06-11 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718440#action_12718440
 ] 

Michael McCandless commented on LUCENE-1539:


Jason this patch seems close... are you gonna have time/itch to finish this 
soonish?

 Improve Benchmark
 -

 Key: LUCENE-1539
 URL: https://issues.apache.org/jira/browse/LUCENE-1539
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, 
 LUCENE-1539.patch, sortBench2.py, sortCollate2.py

   Original Estimate: 336h
  Remaining Estimate: 336h

 Benchmark can be improved by incorporating recent suggestions posted
 on java-dev. M. McCandless' Python scripts that execute multiple
 rounds of tests can either be incorporated into the codebase or
 converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1539) Improve Benchmark

2009-06-11 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718575#action_12718575
 ] 

Jason Rutherglen commented on LUCENE-1539:
--

It would be good to get done, we need the deletes to randomly delete, or maybe 
just delete only docs that aren't already deleted?  (i.e. the loop tries to 
delete at a pos, if it's already deleted, try the next spot, etc).

 Improve Benchmark
 -

 Key: LUCENE-1539
 URL: https://issues.apache.org/jira/browse/LUCENE-1539
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, 
 LUCENE-1539.patch, sortBench2.py, sortCollate2.py

   Original Estimate: 336h
  Remaining Estimate: 336h

 Benchmark can be improved by incorporating recent suggestions posted
 on java-dev. M. McCandless' Python scripts that execute multiple
 rounds of tests can either be incorporated into the codebase or
 converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1539) Improve Benchmark

2009-06-11 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718578#action_12718578
 ] 

Michael McCandless commented on LUCENE-1539:


Right, I think deleteDocsByPercent should 1) determine how many docs to delete 
(deletePct * reader.numDocs()), and then 2) random select ones to delete, 
counting how many actually were deleted, and stopping when it reaches the 
target.  To avoid this taking excessively long when too many deletions are 
requested, you should probably invert if the %tg is  50?  Ie, choose instead 
the docs NOT to delete, and then make a linear sweep to delete any docs not 
chosen?

 Improve Benchmark
 -

 Key: LUCENE-1539
 URL: https://issues.apache.org/jira/browse/LUCENE-1539
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, 
 LUCENE-1539.patch, sortBench2.py, sortCollate2.py

   Original Estimate: 336h
  Remaining Estimate: 336h

 Benchmark can be improved by incorporating recent suggestions posted
 on java-dev. M. McCandless' Python scripts that execute multiple
 rounds of tests can either be incorporated into the codebase or
 converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1539) Improve Benchmark

2009-04-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702436#action_12702436
 ] 

Michael McCandless commented on LUCENE-1539:



{quote}
Yeah? Ok. So the deleteDocsByPercent method needs to somehow
take into account whether it's deleted before by adjusting the
doc nums it's deleting?
{quote}

How about randomly choosing docs to delete instead of every N?  Then
you don't need to keep track?

{quote}
 I don't think we can relax that. This (single transaction
 (writer) open at once) is a core assumption in Lucene.

True, however doesn't mean we have to stick with it, especially
internally. Hopefully we can move to a more componentized model
someone could change this if they wanted. Perhaps in the
flexible indexing revamp
{quote}

We'd need to figure out how to get multiple writers to properly
cooperate.  Actually Marvin is working on something like this (for
KS/Lucy), where one lightweight writer can do adds/deletes/small
merges, and a separate heavyweight writer does large merges.


 Improve Benchmark
 -

 Key: LUCENE-1539
 URL: https://issues.apache.org/jira/browse/LUCENE-1539
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, 
 LUCENE-1539.patch, sortBench2.py, sortCollate2.py

   Original Estimate: 336h
  Remaining Estimate: 336h

 Benchmark can be improved by incorporating recent suggestions posted
 on java-dev. M. McCandless' Python scripts that execute multiple
 rounds of tests can either be incorporated into the codebase or
 converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1539) Improve Benchmark

2009-04-22 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12701768#action_12701768
 ] 

Jason Rutherglen commented on LUCENE-1539:
--

{quote}
I think it should mean delete XXX% of the remaining
undeleted docs?
{quote}

Yeah? Ok. So the deleteDocsByPercent method needs to somehow
take into account whether it's deleted before by adjusting the
doc nums it's deleting?

{quote}
I don't think we can relax that. This (single transaction
(writer) open at once) is a core assumption in Lucene.
{quote}

True, however doesn't mean we have to stick with it, especially
internally. Hopefully we can move to a more componentized model
someone could change this if they wanted. Perhaps in the
flexible indexing revamp?





 Improve Benchmark
 -

 Key: LUCENE-1539
 URL: https://issues.apache.org/jira/browse/LUCENE-1539
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, 
 LUCENE-1539.patch, sortBench2.py, sortCollate2.py

   Original Estimate: 336h
  Remaining Estimate: 336h

 Benchmark can be improved by incorporating recent suggestions posted
 on java-dev. M. McCandless' Python scripts that execute multiple
 rounds of tests can either be incorporated into the codebase or
 converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1539) Improve Benchmark

2009-04-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697468#action_12697468
 ] 

Michael McCandless commented on LUCENE-1539:


This patch still has some noise, eg the unused *Property additions to 
PerfRunData, the nocommit first logic in ReadTask.

On DeleteTaskByPercentTask: should it delete a pctg of the undeleted 
(numDocs()) docs or of the total (maxDoc()) doc space?  Right now its 
implementation is dangerous, eg, if I delete 5% of the index and then 10%, that 
10% delete will do nothing, since the docs it deletes will fall onto the exact 
docs that the 5% had deleted.

{quote}
It seems a bit awkward that DeleteByPercentTask needs to call
IR.undeleteAll before executing the deletes.
{quote}

Oh, I see.  I don't think it should do that?  I think it should mean delete 
XXX% of the remaining undeleted docs?

{quote}
Also that
subsequent delete by percent calls in deletepercent.alg need to
open the latest version of the index rather than the original
(which does not have deletes)
{quote}

This seems correct?  Ie the purpose of this task is open the latest commit on 
the index, delete XXX% of its undeleted docs.

{quote}
This is due to
DirectoryIndexReader.acquireWriteLock checking to insure the
latest version of the index is locked. Perhaps we can relax
this? I would rather be able to open a commit point and delete
from the reader, then flush as the latest version.
{quote}
I don't think we can relax that.  This (single transaction (writer) open at 
once) is a core assumption in Lucene.

 Improve Benchmark
 -

 Key: LUCENE-1539
 URL: https://issues.apache.org/jira/browse/LUCENE-1539
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, 
 LUCENE-1539.patch, sortBench2.py, sortCollate2.py

   Original Estimate: 336h
  Remaining Estimate: 336h

 Benchmark can be improved by incorporating recent suggestions posted
 on java-dev. M. McCandless' Python scripts that execute multiple
 rounds of tests can either be incorporated into the codebase or
 converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1539) Improve Benchmark

2009-04-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12696971#action_12696971
 ] 

Michael McCandless commented on LUCENE-1539:


I think DeleteByPercentTask.java is missing?

Also: I think you're missing the ability to set the deletion policy for the 
reader or writer?  Without that, only the last commit is retained.

 Improve Benchmark
 -

 Key: LUCENE-1539
 URL: https://issues.apache.org/jira/browse/LUCENE-1539
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, 
 sortBench2.py, sortCollate2.py

   Original Estimate: 336h
  Remaining Estimate: 336h

 Benchmark can be improved by incorporating recent suggestions posted
 on java-dev. M. McCandless' Python scripts that execute multiple
 rounds of tests can either be incorporated into the codebase or
 converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1539) Improve Benchmark

2009-04-08 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697057#action_12697057
 ] 

Shai Erera commented on LUCENE-1539:


Is it also interesting to add extensions to EnwikiDocMaker, WriteLineDoc and 
LineDocMaker which can read/write the content in a bzip format?
I downloaded the latest Enwiki dump, 4.5 GB in bzip format. Extracted XML size 
is 17GB. I thought to myslef that I don't have a real reason to extract it - I 
can read the content directly from the bzip-type file.

So I looked around and found out that in ant.jar there are two classes which 
can read/write that format. Just to compare, I gzipped the XML file and the 
result was 5.1GB file (~13% larger). The general measurements on the web also 
show bzip is superior to gzip, although it probably runs a bit slower.

I then ran the WriteLineDoc task, to produce the one-line-per-document text 
file, and stopped when it reache 228MB. Again, I zipped, gzipped and bzipped 
the file, and the bzip format was smaller by ~20%.

So I was wondering - besides the speed of writing from a compressed archive, 
which is slwoer than reading from a plain XML or TXT file, is there a reason 
why we don't use bzip/gzip when reading content? It will save a lot of space 
and I'm not sure that part of the indexing is what's most important.
However, I'm aware that some people might find it better to read from plain 
files, so I suggest we just have extensions which can read/write the compressed 
format.
The question is, assuming you agree to it, should we use bzip (which requires 
external library) or gzip which is in the JDK, does not compress as good as 
bzip, but might have better performance (I can give it some measurements if 
needed, but the main question I have is whether we want to introduce a 
dependency on another library).

If this belongs in a separate issue, let me know.

 Improve Benchmark
 -

 Key: LUCENE-1539
 URL: https://issues.apache.org/jira/browse/LUCENE-1539
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, 
 sortBench2.py, sortCollate2.py

   Original Estimate: 336h
  Remaining Estimate: 336h

 Benchmark can be improved by incorporating recent suggestions posted
 on java-dev. M. McCandless' Python scripts that execute multiple
 rounds of tests can either be incorporated into the codebase or
 converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1539) Improve Benchmark

2009-04-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697088#action_12697088
 ] 

Michael McCandless commented on LUCENE-1539:


Enabling bzip compression sounds like a win; the added dependency to 
contrib/benchmark seems fine (it already has several external dependencies).

Can you open a new issue?

 Improve Benchmark
 -

 Key: LUCENE-1539
 URL: https://issues.apache.org/jira/browse/LUCENE-1539
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, 
 sortBench2.py, sortCollate2.py

   Original Estimate: 336h
  Remaining Estimate: 336h

 Benchmark can be improved by incorporating recent suggestions posted
 on java-dev. M. McCandless' Python scripts that execute multiple
 rounds of tests can either be incorporated into the codebase or
 converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




[jira] Commented: (LUCENE-1539) Improve Benchmark

2009-04-08 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697117#action_12697117
 ] 

Shai Erera commented on LUCENE-1539:


bq. Can you open a new issue?

Will do.

 Improve Benchmark
 -

 Key: LUCENE-1539
 URL: https://issues.apache.org/jira/browse/LUCENE-1539
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, 
 sortBench2.py, sortCollate2.py

   Original Estimate: 336h
  Remaining Estimate: 336h

 Benchmark can be improved by incorporating recent suggestions posted
 on java-dev. M. McCandless' Python scripts that execute multiple
 rounds of tests can either be incorporated into the codebase or
 converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1539) Improve Benchmark

2009-04-03 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695462#action_12695462
 ] 

Michael McCandless commented on LUCENE-1539:


This patch looks good -- some questions:

  * Is CreateWikiIndex intended to be committed?  I thought not?  Ie I
though the goal w/ this issue is add the necessary tasks so that
CreateWikiIndex would be done as an alg.

  * I think we shouldn't bump to Java 1.5 -- it's only CreateWikiIndex
that needs it anyway (in only 2 places).

  * PrintReaderTask never closes the reader.

  * Not sure why you needed to relax private - protected in AddDocTask?


 Improve Benchmark
 -

 Key: LUCENE-1539
 URL: https://issues.apache.org/jira/browse/LUCENE-1539
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, 
 sortCollate2.py

   Original Estimate: 336h
  Remaining Estimate: 336h

 Benchmark can be improved by incorporating recent suggestions posted
 on java-dev. M. McCandless' Python scripts that execute multiple
 rounds of tests can either be incorporated into the codebase or
 converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1539) Improve Benchmark

2009-03-09 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12680276#action_12680276
 ] 

Jason Rutherglen commented on LUCENE-1539:
--

For a performing simultaneous indexing and searching, how should we
best represent this in the .alg file? We have an example
indexing-multithreaded.alg so I suppose we can simply spawn another
set of threads after the [{ MAddDocs AddDoc } : 5000] : 4 line
that performs searches? Just gathering opinions as I don't feel
completely familiar with the benchmark suite yet.

 Improve Benchmark
 -

 Key: LUCENE-1539
 URL: https://issues.apache.org/jira/browse/LUCENE-1539
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1539.patch, sortBench2.py, sortCollate2.py

   Original Estimate: 336h
  Remaining Estimate: 336h

 Benchmark can be improved by incorporating recent suggestions posted
 on java-dev. M. McCandless' Python scripts that execute multiple
 rounds of tests can either be incorporated into the codebase or
 converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1539) Improve Benchmark

2009-02-20 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12675336#action_12675336
 ] 

Michael McCandless commented on LUCENE-1539:


bq. In looking over the code, to do the multiple commits using IR we'll need to 
add a IR.flush(String userData) method?

Yes, we should.  Can you open a new issue + patch?

We also have to fix contrib/benchmark to allow specification of a Deletion 
Policy, and then allow openReader task to take a string (userData) to specific 
which commit to open.

But: it'd be best if, within a single alg, we could specify a series of commits 
to open, so that we can iterate over the different commit points.  I don't 
think a param to the task allows this?  (But I'm not sure).  If we made it a 
config option then I believe we could specify a sequence which each round would 
advance through.

 Improve Benchmark
 -

 Key: LUCENE-1539
 URL: https://issues.apache.org/jira/browse/LUCENE-1539
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1539.patch, sortBench2.py, sortCollate2.py

   Original Estimate: 336h
  Remaining Estimate: 336h

 Benchmark can be improved by incorporating recent suggestions posted
 on java-dev. M. McCandless' Python scripts that execute multiple
 rounds of tests can either be incorporated into the codebase or
 converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1539) Improve Benchmark

2009-02-19 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12675196#action_12675196
 ] 

Jason Rutherglen commented on LUCENE-1539:
--

In looking over the code, to do the multiple commits using IR we'll need to add 
a IR.flush(String userData) method?

 Improve Benchmark
 -

 Key: LUCENE-1539
 URL: https://issues.apache.org/jira/browse/LUCENE-1539
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1539.patch, sortBench2.py, sortCollate2.py

   Original Estimate: 336h
  Remaining Estimate: 336h

 Benchmark can be improved by incorporating recent suggestions posted
 on java-dev. M. McCandless' Python scripts that execute multiple
 rounds of tests can either be incorporated into the codebase or
 converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org