[jira] Commented: (LUCENE-1539) Improve Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722738#action_12722738 ] Jason Rutherglen commented on LUCENE-1539: -- Took a look at Lucene in Action at Borders and learned the -Dproperty passed in overrides what's in the build.xml. Improve Benchmark - Key: LUCENE-1539 URL: https://issues.apache.org/jira/browse/LUCENE-1539 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Affects Versions: 2.4 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py Original Estimate: 336h Remaining Estimate: 336h Benchmark can be improved by incorporating recent suggestions posted on java-dev. M. McCandless' Python scripts that execute multiple rounds of tests can either be incorporated into the codebase or converted to Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1539) Improve Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721705#action_12721705 ] Michael McCandless commented on LUCENE-1539: Where are we assuming/requiring the path be relative? Improve Benchmark - Key: LUCENE-1539 URL: https://issues.apache.org/jira/browse/LUCENE-1539 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Affects Versions: 2.4 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py Original Estimate: 336h Remaining Estimate: 336h Benchmark can be improved by incorporating recent suggestions posted on java-dev. M. McCandless' Python scripts that execute multiple rounds of tests can either be incorporated into the codebase or converted to Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1539) Improve Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721586#action_12721586 ] Jason Rutherglen commented on LUCENE-1539: -- I think it would be convenient to allow passing in the data files' absolute path, instead of assuming they're in a relative path. Improve Benchmark - Key: LUCENE-1539 URL: https://issues.apache.org/jira/browse/LUCENE-1539 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Affects Versions: 2.4 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py Original Estimate: 336h Remaining Estimate: 336h Benchmark can be improved by incorporating recent suggestions posted on java-dev. M. McCandless' Python scripts that execute multiple rounds of tests can either be incorporated into the codebase or converted to Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1539) Improve Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718899#action_12718899 ] Michael McCandless commented on LUCENE-1539: Thanks Jason, getting close: * Can you add contrib/benchmark/CHANGES entry? * The new source files need a copyright header * Can you remove the undeleteAll? I don't think the DeleteByPercentTask should do that. * Can you make its param a real percent, ie so DeleteByPercent(25) deletes 25% of the remaining docs. * The random-pick is going to be too slow once too many docs are deleted (I mentioned this above, too). How to fix? Improve Benchmark - Key: LUCENE-1539 URL: https://issues.apache.org/jira/browse/LUCENE-1539 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Affects Versions: 2.4 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py Original Estimate: 336h Remaining Estimate: 336h Benchmark can be improved by incorporating recent suggestions posted on java-dev. M. McCandless' Python scripts that execute multiple rounds of tests can either be incorporated into the codebase or converted to Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1539) Improve Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718907#action_12718907 ] Michael McCandless commented on LUCENE-1539: bq. When existing deletes are over 50%, we loop through termdocs instead. OK good, except it's deleting too aggressively when 50% deletions are already present (using nextBoolean()). Can you change that to target a certain deletion rate? Ie if you need to delete 20%, then do random.nextDouble() 0.20 to do the delete? But then I guess put a floor on that rate so that it doesn't get too slow on the tail? It won't be perfectly random when it hits that tail but I think that's OK. Improve Benchmark - Key: LUCENE-1539 URL: https://issues.apache.org/jira/browse/LUCENE-1539 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Affects Versions: 2.4 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py Original Estimate: 336h Remaining Estimate: 336h Benchmark can be improved by incorporating recent suggestions posted on java-dev. M. McCandless' Python scripts that execute multiple rounds of tests can either be incorporated into the codebase or converted to Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1539) Improve Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718957#action_12718957 ] Jason Rutherglen commented on LUCENE-1539: -- The only small thing that came to mind is if the user decides to subsequently (in the .alg) delete a lesser percentage of docs than the what exists in the reader. Does that mean we should undelete docs? Improve Benchmark - Key: LUCENE-1539 URL: https://issues.apache.org/jira/browse/LUCENE-1539 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Affects Versions: 2.4 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py Original Estimate: 336h Remaining Estimate: 336h Benchmark can be improved by incorporating recent suggestions posted on java-dev. M. McCandless' Python scripts that execute multiple rounds of tests can either be incorporated into the codebase or converted to Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1539) Improve Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718960#action_12718960 ] Michael McCandless commented on LUCENE-1539: I'd say we don't allow that now. EG one can easily save open a past commit point, with less deletions? But maybe we should throw an exception if you attempt this, so you don't falsely think it worked. I'll make that change. Improve Benchmark - Key: LUCENE-1539 URL: https://issues.apache.org/jira/browse/LUCENE-1539 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Affects Versions: 2.4 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py Original Estimate: 336h Remaining Estimate: 336h Benchmark can be improved by incorporating recent suggestions posted on java-dev. M. McCandless' Python scripts that execute multiple rounds of tests can either be incorporated into the codebase or converted to Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1539) Improve Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718961#action_12718961 ] Michael McCandless commented on LUCENE-1539: Or... maybe we should just do undeleteAll all that case? I'll take that approach instead. Improve Benchmark - Key: LUCENE-1539 URL: https://issues.apache.org/jira/browse/LUCENE-1539 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Affects Versions: 2.4 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py Original Estimate: 336h Remaining Estimate: 336h Benchmark can be improved by incorporating recent suggestions posted on java-dev. M. McCandless' Python scripts that execute multiple rounds of tests can either be incorporated into the codebase or converted to Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1539) Improve Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718440#action_12718440 ] Michael McCandless commented on LUCENE-1539: Jason this patch seems close... are you gonna have time/itch to finish this soonish? Improve Benchmark - Key: LUCENE-1539 URL: https://issues.apache.org/jira/browse/LUCENE-1539 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Affects Versions: 2.4 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py Original Estimate: 336h Remaining Estimate: 336h Benchmark can be improved by incorporating recent suggestions posted on java-dev. M. McCandless' Python scripts that execute multiple rounds of tests can either be incorporated into the codebase or converted to Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1539) Improve Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718575#action_12718575 ] Jason Rutherglen commented on LUCENE-1539: -- It would be good to get done, we need the deletes to randomly delete, or maybe just delete only docs that aren't already deleted? (i.e. the loop tries to delete at a pos, if it's already deleted, try the next spot, etc). Improve Benchmark - Key: LUCENE-1539 URL: https://issues.apache.org/jira/browse/LUCENE-1539 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Affects Versions: 2.4 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py Original Estimate: 336h Remaining Estimate: 336h Benchmark can be improved by incorporating recent suggestions posted on java-dev. M. McCandless' Python scripts that execute multiple rounds of tests can either be incorporated into the codebase or converted to Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1539) Improve Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718578#action_12718578 ] Michael McCandless commented on LUCENE-1539: Right, I think deleteDocsByPercent should 1) determine how many docs to delete (deletePct * reader.numDocs()), and then 2) random select ones to delete, counting how many actually were deleted, and stopping when it reaches the target. To avoid this taking excessively long when too many deletions are requested, you should probably invert if the %tg is 50? Ie, choose instead the docs NOT to delete, and then make a linear sweep to delete any docs not chosen? Improve Benchmark - Key: LUCENE-1539 URL: https://issues.apache.org/jira/browse/LUCENE-1539 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Affects Versions: 2.4 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py Original Estimate: 336h Remaining Estimate: 336h Benchmark can be improved by incorporating recent suggestions posted on java-dev. M. McCandless' Python scripts that execute multiple rounds of tests can either be incorporated into the codebase or converted to Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1539) Improve Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702436#action_12702436 ] Michael McCandless commented on LUCENE-1539: {quote} Yeah? Ok. So the deleteDocsByPercent method needs to somehow take into account whether it's deleted before by adjusting the doc nums it's deleting? {quote} How about randomly choosing docs to delete instead of every N? Then you don't need to keep track? {quote} I don't think we can relax that. This (single transaction (writer) open at once) is a core assumption in Lucene. True, however doesn't mean we have to stick with it, especially internally. Hopefully we can move to a more componentized model someone could change this if they wanted. Perhaps in the flexible indexing revamp {quote} We'd need to figure out how to get multiple writers to properly cooperate. Actually Marvin is working on something like this (for KS/Lucy), where one lightweight writer can do adds/deletes/small merges, and a separate heavyweight writer does large merges. Improve Benchmark - Key: LUCENE-1539 URL: https://issues.apache.org/jira/browse/LUCENE-1539 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Affects Versions: 2.4 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py Original Estimate: 336h Remaining Estimate: 336h Benchmark can be improved by incorporating recent suggestions posted on java-dev. M. McCandless' Python scripts that execute multiple rounds of tests can either be incorporated into the codebase or converted to Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1539) Improve Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12701768#action_12701768 ] Jason Rutherglen commented on LUCENE-1539: -- {quote} I think it should mean delete XXX% of the remaining undeleted docs? {quote} Yeah? Ok. So the deleteDocsByPercent method needs to somehow take into account whether it's deleted before by adjusting the doc nums it's deleting? {quote} I don't think we can relax that. This (single transaction (writer) open at once) is a core assumption in Lucene. {quote} True, however doesn't mean we have to stick with it, especially internally. Hopefully we can move to a more componentized model someone could change this if they wanted. Perhaps in the flexible indexing revamp? Improve Benchmark - Key: LUCENE-1539 URL: https://issues.apache.org/jira/browse/LUCENE-1539 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Affects Versions: 2.4 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py Original Estimate: 336h Remaining Estimate: 336h Benchmark can be improved by incorporating recent suggestions posted on java-dev. M. McCandless' Python scripts that execute multiple rounds of tests can either be incorporated into the codebase or converted to Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1539) Improve Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697468#action_12697468 ] Michael McCandless commented on LUCENE-1539: This patch still has some noise, eg the unused *Property additions to PerfRunData, the nocommit first logic in ReadTask. On DeleteTaskByPercentTask: should it delete a pctg of the undeleted (numDocs()) docs or of the total (maxDoc()) doc space? Right now its implementation is dangerous, eg, if I delete 5% of the index and then 10%, that 10% delete will do nothing, since the docs it deletes will fall onto the exact docs that the 5% had deleted. {quote} It seems a bit awkward that DeleteByPercentTask needs to call IR.undeleteAll before executing the deletes. {quote} Oh, I see. I don't think it should do that? I think it should mean delete XXX% of the remaining undeleted docs? {quote} Also that subsequent delete by percent calls in deletepercent.alg need to open the latest version of the index rather than the original (which does not have deletes) {quote} This seems correct? Ie the purpose of this task is open the latest commit on the index, delete XXX% of its undeleted docs. {quote} This is due to DirectoryIndexReader.acquireWriteLock checking to insure the latest version of the index is locked. Perhaps we can relax this? I would rather be able to open a commit point and delete from the reader, then flush as the latest version. {quote} I don't think we can relax that. This (single transaction (writer) open at once) is a core assumption in Lucene. Improve Benchmark - Key: LUCENE-1539 URL: https://issues.apache.org/jira/browse/LUCENE-1539 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Affects Versions: 2.4 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py Original Estimate: 336h Remaining Estimate: 336h Benchmark can be improved by incorporating recent suggestions posted on java-dev. M. McCandless' Python scripts that execute multiple rounds of tests can either be incorporated into the codebase or converted to Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1539) Improve Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12696971#action_12696971 ] Michael McCandless commented on LUCENE-1539: I think DeleteByPercentTask.java is missing? Also: I think you're missing the ability to set the deletion policy for the reader or writer? Without that, only the last commit is retained. Improve Benchmark - Key: LUCENE-1539 URL: https://issues.apache.org/jira/browse/LUCENE-1539 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Affects Versions: 2.4 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py Original Estimate: 336h Remaining Estimate: 336h Benchmark can be improved by incorporating recent suggestions posted on java-dev. M. McCandless' Python scripts that execute multiple rounds of tests can either be incorporated into the codebase or converted to Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1539) Improve Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697057#action_12697057 ] Shai Erera commented on LUCENE-1539: Is it also interesting to add extensions to EnwikiDocMaker, WriteLineDoc and LineDocMaker which can read/write the content in a bzip format? I downloaded the latest Enwiki dump, 4.5 GB in bzip format. Extracted XML size is 17GB. I thought to myslef that I don't have a real reason to extract it - I can read the content directly from the bzip-type file. So I looked around and found out that in ant.jar there are two classes which can read/write that format. Just to compare, I gzipped the XML file and the result was 5.1GB file (~13% larger). The general measurements on the web also show bzip is superior to gzip, although it probably runs a bit slower. I then ran the WriteLineDoc task, to produce the one-line-per-document text file, and stopped when it reache 228MB. Again, I zipped, gzipped and bzipped the file, and the bzip format was smaller by ~20%. So I was wondering - besides the speed of writing from a compressed archive, which is slwoer than reading from a plain XML or TXT file, is there a reason why we don't use bzip/gzip when reading content? It will save a lot of space and I'm not sure that part of the indexing is what's most important. However, I'm aware that some people might find it better to read from plain files, so I suggest we just have extensions which can read/write the compressed format. The question is, assuming you agree to it, should we use bzip (which requires external library) or gzip which is in the JDK, does not compress as good as bzip, but might have better performance (I can give it some measurements if needed, but the main question I have is whether we want to introduce a dependency on another library). If this belongs in a separate issue, let me know. Improve Benchmark - Key: LUCENE-1539 URL: https://issues.apache.org/jira/browse/LUCENE-1539 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Affects Versions: 2.4 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py Original Estimate: 336h Remaining Estimate: 336h Benchmark can be improved by incorporating recent suggestions posted on java-dev. M. McCandless' Python scripts that execute multiple rounds of tests can either be incorporated into the codebase or converted to Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1539) Improve Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697088#action_12697088 ] Michael McCandless commented on LUCENE-1539: Enabling bzip compression sounds like a win; the added dependency to contrib/benchmark seems fine (it already has several external dependencies). Can you open a new issue? Improve Benchmark - Key: LUCENE-1539 URL: https://issues.apache.org/jira/browse/LUCENE-1539 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Affects Versions: 2.4 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py Original Estimate: 336h Remaining Estimate: 336h Benchmark can be improved by incorporating recent suggestions posted on java-dev. M. McCandless' Python scripts that execute multiple rounds of tests can either be incorporated into the codebase or converted to Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1539) Improve Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697117#action_12697117 ] Shai Erera commented on LUCENE-1539: bq. Can you open a new issue? Will do. Improve Benchmark - Key: LUCENE-1539 URL: https://issues.apache.org/jira/browse/LUCENE-1539 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Affects Versions: 2.4 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py Original Estimate: 336h Remaining Estimate: 336h Benchmark can be improved by incorporating recent suggestions posted on java-dev. M. McCandless' Python scripts that execute multiple rounds of tests can either be incorporated into the codebase or converted to Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1539) Improve Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695462#action_12695462 ] Michael McCandless commented on LUCENE-1539: This patch looks good -- some questions: * Is CreateWikiIndex intended to be committed? I thought not? Ie I though the goal w/ this issue is add the necessary tasks so that CreateWikiIndex would be done as an alg. * I think we shouldn't bump to Java 1.5 -- it's only CreateWikiIndex that needs it anyway (in only 2 places). * PrintReaderTask never closes the reader. * Not sure why you needed to relax private - protected in AddDocTask? Improve Benchmark - Key: LUCENE-1539 URL: https://issues.apache.org/jira/browse/LUCENE-1539 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Affects Versions: 2.4 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py Original Estimate: 336h Remaining Estimate: 336h Benchmark can be improved by incorporating recent suggestions posted on java-dev. M. McCandless' Python scripts that execute multiple rounds of tests can either be incorporated into the codebase or converted to Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1539) Improve Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12680276#action_12680276 ] Jason Rutherglen commented on LUCENE-1539: -- For a performing simultaneous indexing and searching, how should we best represent this in the .alg file? We have an example indexing-multithreaded.alg so I suppose we can simply spawn another set of threads after the [{ MAddDocs AddDoc } : 5000] : 4 line that performs searches? Just gathering opinions as I don't feel completely familiar with the benchmark suite yet. Improve Benchmark - Key: LUCENE-1539 URL: https://issues.apache.org/jira/browse/LUCENE-1539 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Affects Versions: 2.4 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Attachments: LUCENE-1539.patch, sortBench2.py, sortCollate2.py Original Estimate: 336h Remaining Estimate: 336h Benchmark can be improved by incorporating recent suggestions posted on java-dev. M. McCandless' Python scripts that execute multiple rounds of tests can either be incorporated into the codebase or converted to Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1539) Improve Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12675336#action_12675336 ] Michael McCandless commented on LUCENE-1539: bq. In looking over the code, to do the multiple commits using IR we'll need to add a IR.flush(String userData) method? Yes, we should. Can you open a new issue + patch? We also have to fix contrib/benchmark to allow specification of a Deletion Policy, and then allow openReader task to take a string (userData) to specific which commit to open. But: it'd be best if, within a single alg, we could specify a series of commits to open, so that we can iterate over the different commit points. I don't think a param to the task allows this? (But I'm not sure). If we made it a config option then I believe we could specify a sequence which each round would advance through. Improve Benchmark - Key: LUCENE-1539 URL: https://issues.apache.org/jira/browse/LUCENE-1539 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Affects Versions: 2.4 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Attachments: LUCENE-1539.patch, sortBench2.py, sortCollate2.py Original Estimate: 336h Remaining Estimate: 336h Benchmark can be improved by incorporating recent suggestions posted on java-dev. M. McCandless' Python scripts that execute multiple rounds of tests can either be incorporated into the codebase or converted to Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1539) Improve Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12675196#action_12675196 ] Jason Rutherglen commented on LUCENE-1539: -- In looking over the code, to do the multiple commits using IR we'll need to add a IR.flush(String userData) method? Improve Benchmark - Key: LUCENE-1539 URL: https://issues.apache.org/jira/browse/LUCENE-1539 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Affects Versions: 2.4 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Attachments: LUCENE-1539.patch, sortBench2.py, sortCollate2.py Original Estimate: 336h Remaining Estimate: 336h Benchmark can be improved by incorporating recent suggestions posted on java-dev. M. McCandless' Python scripts that execute multiple rounds of tests can either be incorporated into the codebase or converted to Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org