[jira] Updated: (LUCENE-2393) Utility to output total term frequency and df from a lucene index
[ https://issues.apache.org/jira/browse/LUCENE-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tom Burton-West updated LUCENE-2393: Attachment: LUCENE-2393.patch Updated the HighFreqTermsWithTF to use flex API. I don't understand the flex API well enough yet to determine if I should have used DocsEnum.read/DocsEnum.getBulkResult() to do a bulk read instead of DocsEnum.nextDoc() and DocsEnum.freq().. > Utility to output total term frequency and df from a lucene index > - > > Key: LUCENE-2393 > URL: https://issues.apache.org/jira/browse/LUCENE-2393 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Reporter: Tom Burton-West >Priority: Trivial > Attachments: LUCENE-2393.patch, LUCENE-2393.patch, LUCENE-2393.patch > > > This is a command line utility that takes a field name, term, and index > directory and outputs the document frequency for the term and the total > number of occurrences of the term in the index (i.e. the sum of the tf of the > term for each document). It is useful for estimating the size of the term's > entry in the *prx files and consequent Disk I/O demands -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2393) Utility to output total term frequency and df from a lucene index
[ https://issues.apache.org/jira/browse/LUCENE-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tom Burton-West updated LUCENE-2393: Attachment: LUCENE-2393.patch New patch includes a (pre-flex ) version of HighFreqTerms that finds the top N terms with the highest docFreq and looks up the total term frequency and outputs the list of terms sorted by highest term frequency (which approximates the largest entries in the *prx files).I'm not sure how to combine the GetTermInfo program, with either version of HighFreqTerms in a way that leads to sane command line arguments and argument processing. I suppose that HighFreqTerms could have a flag that turns on or off the inclusion of total term frequency. > Utility to output total term frequency and df from a lucene index > - > > Key: LUCENE-2393 > URL: https://issues.apache.org/jira/browse/LUCENE-2393 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Reporter: Tom Burton-West >Priority: Trivial > Attachments: LUCENE-2393.patch, LUCENE-2393.patch > > > This is a command line utility that takes a field name, term, and index > directory and outputs the document frequency for the term and the total > number of occurrences of the term in the index (i.e. the sum of the tf of the > term for each document). It is useful for estimating the size of the term's > entry in the *prx files and consequent Disk I/O demands -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2393) Utility to output total term frequency and df from a lucene index
[ https://issues.apache.org/jira/browse/LUCENE-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856965#action_12856965 ] Tom Burton-West edited comment on LUCENE-2393 at 4/14/10 1:26 PM: -- Patch against recent trunk. Can someone please suggest an appropriate existing unit test to use as a model for creating a unit test for this? Would it be appropriate to include a small index file for testing or is it better to programatically create the index file? was (Author: tburtonwest): Patch against recent trunk > Utility to output total term frequency and df from a lucene index > - > > Key: LUCENE-2393 > URL: https://issues.apache.org/jira/browse/LUCENE-2393 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Reporter: Tom Burton-West >Priority: Trivial > Attachments: LUCENE-2393.patch > > > This is a command line utility that takes a field name, term, and index > directory and outputs the document frequency for the term and the total > number of occurrences of the term in the index (i.e. the sum of the tf of the > term for each document). It is useful for estimating the size of the term's > entry in the *prx files and consequent Disk I/O demands -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2393) Utility to output total term frequency and df from a lucene index
[ https://issues.apache.org/jira/browse/LUCENE-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856967#action_12856967 ] Tom Burton-West commented on LUCENE-2393: - For an example of how this utility can be used please see: http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-1 > Utility to output total term frequency and df from a lucene index > - > > Key: LUCENE-2393 > URL: https://issues.apache.org/jira/browse/LUCENE-2393 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Reporter: Tom Burton-West >Priority: Trivial > Attachments: LUCENE-2393.patch > > > This is a command line utility that takes a field name, term, and index > directory and outputs the document frequency for the term and the total > number of occurrences of the term in the index (i.e. the sum of the tf of the > term for each document). It is useful for estimating the size of the term's > entry in the *prx files and consequent Disk I/O demands -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2393) Utility to output total term frequency and df from a lucene index
[ https://issues.apache.org/jira/browse/LUCENE-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tom Burton-West updated LUCENE-2393: Attachment: LUCENE-2393.patch Patch against recent trunk > Utility to output total term frequency and df from a lucene index > - > > Key: LUCENE-2393 > URL: https://issues.apache.org/jira/browse/LUCENE-2393 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Reporter: Tom Burton-West >Priority: Trivial > Attachments: LUCENE-2393.patch > > > This is a command line utility that takes a field name, term, and index > directory and outputs the document frequency for the term and the total > number of occurrences of the term in the index (i.e. the sum of the tf of the > term for each document). It is useful for estimating the size of the term's > entry in the *prx files and consequent Disk I/O demands -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2393) Utility to output total term frequency and df from a lucene index
Utility to output total term frequency and df from a lucene index - Key: LUCENE-2393 URL: https://issues.apache.org/jira/browse/LUCENE-2393 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Tom Burton-West Priority: Trivial This is a command line utility that takes a field name, term, and index directory and outputs the document frequency for the term and the total number of occurrences of the term in the index (i.e. the sum of the tf of the term for each document). It is useful for estimating the size of the term's entry in the *prx files and consequent Disk I/O demands -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1709) Parallelize Tests
[ https://issues.apache.org/jira/browse/LUCENE-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855022#action_12855022 ] Tom Burton-West commented on LUCENE-1709: - Hi Robert, I patched Revision 931708 and ran "ant clean test-contribute" and the tests ran just fine. The patch seems to have solved the problem. Tom > Parallelize Tests > - > > Key: LUCENE-1709 > URL: https://issues.apache.org/jira/browse/LUCENE-1709 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.4.1 >Reporter: Jason Rutherglen >Assignee: Robert Muir > Fix For: 3.1 > > Attachments: LUCENE-1709-2.patch, LUCENE-1709.patch, > LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, > LUCENE-1709.patch, runLuceneTests.py > > Original Estimate: 48h > Remaining Estimate: 48h > > The Lucene tests can be parallelized to make for a faster testing system. > This task from ANT can be used: > http://ant.apache.org/manual/CoreTasks/parallel.html > Previous discussion: > http://www.gossamer-threads.com/lists/lucene/java-dev/69669 > Notes from Mike M.: > {quote} > I'd love to see a clean solution here (the tests are embarrassingly > parallelizable, and we all have machines with good concurrency these > days)... I have a rather hacked up solution now, that uses > "-Dtestpackage=XXX" to split the tests up. > Ideally I would be able to say "use N threads" and it'd do the right > thing... like the -j flag to make. > {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1709) Parallelize Tests
[ https://issues.apache.org/jira/browse/LUCENE-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854960#action_12854960 ] Tom Burton-West commented on LUCENE-1709: - This may or may not be a clue to the problem in benchmark. When I control-C'd the hung test, I got the error reported below. Tom. [junit] directory = RAMDirectory [junit] doc.maker = org.apache.lucene.benchmark.byTask.tasks.WriteLineDocTaskTest$JustDateDocMaker [junit] line.file.out = C:\cygwin\home\tburtonw\lucene\april07_good\build\contrib\benchmark\test\W\one-line [junit] --- [junit] - --- [junit] java.io.FileNotFoundException: C:\cygwin\home\tburtonw\lucene\april07_good\contrib\benchmark\junitvmwatcher203463231158436475.properties (The process cannot access the file because it is being used by another process) [junit] at java.io.FileInputStream.open(Native Method) [junit] at java.io.FileInputStream.(FileInputStream.java:106) [junit] at java.io.FileReader.(FileReader.java:55) [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTask.executeAsForked(JUnitTask.java:1025) [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTask.execute(JUnitTask.java:876) [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTask.execute(JUnitTask.java:803) [junit] at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:288) [junit] at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) [junit] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [junit] at java.lang.reflect.Method.invoke(Method.java:597) [junit] at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106) [junit] at org.apache.tools.ant.Task.perform(Task.java:348) [junit] at org.apache.tools.ant.taskdefs.Sequential.execute(Sequential.java:62) [junit] at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:288) [junit] at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) [junit] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [junit] at java.lang.reflect.Method.invoke(Method.java:597) [junit] at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106) [junit] at org.apache.tools.ant.Task.perform(Task.java:348) [junit] at org.apache.tools.ant.taskdefs.MacroInstance.execute(MacroInstance.java:394) [junit] at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:288) [junit] at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) [junit] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [junit] at java.lang.reflect.Method.invoke(Method.java:597) [junit] at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106) [junit] at org.apache.tools.ant.Task.perform(Task.java:348) [junit] at org.apache.tools.ant.taskdefs.Parallel$TaskRunnable.run(Parallel.java:428) [junit] at java.lang.Thread.run(Thread.java:619) > Parallelize Tests > - > > Key: LUCENE-1709 > URL: https://issues.apache.org/jira/browse/LUCENE-1709 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.4.1 >Reporter: Jason Rutherglen >Assignee: Robert Muir > Fix For: 3.1 > > Attachments: LUCENE-1709-2.patch, LUCENE-1709.patch, > LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, > LUCENE-1709.patch, runLuceneTests.py > > Original Estimate: 48h > Remaining Estimate: 48h > > The Lucene tests can be parallelized to make for a faster testing system. > This task from ANT can be used: > http://ant.apache.org/manual/CoreTasks/parallel.html > Previous discussion: > http://www.gossamer-threads.com/lists/lucene/java-dev/69669 > Notes from Mike M.: > {quote} > I'd love to see a clean solution here (the tests are embarrassingly > parallelizable, and we all have machines with good concurrency these > days)... I have a rather hacked up solution now, that uses > "-Dtestpackage=XXX" to split the tests up. > Ideally I would be able to say "use N threads" and it'd do the right > thing... like the -j flag to make. > {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1709) Parallelize Tests
[ https://issues.apache.org/jira/browse/LUCENE-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854957#action_12854957 ] Tom Burton-West commented on LUCENE-1709: - I am having the same issue Shai reported in LUCENE-2353 with the parallel tests apparently causing the tests to hang on my Windows box with both Revision 931573 and Revision 931304 when running the tests from root. Tests hang in WriteLineDocTaskTest, on this line: [junit] > config properties: [junit] directory = RAMDirectory [junit] doc.maker = org.apache.lucene.benchmark.byTask.tasks.WriteLineDocTaskTest$JustDateDocMaker [junit] line.file.out = D:\dev\lucene\lucene-trunk\build\contrib\benchmark\test\W\one-line [junit] --- I just ran the test last night with Revision 931708 and had no problem. Ran it again this morning and got the hanging behavior. The difference is that last night the only thing running on my computer besides a couple of ssh terminal windows was the tests. Today when I ran the tests and got the hanging behavior, I have firefox, outlook, exceed, wordpad open. The tests are taking 98-99.9% of my cpu while hanging. I suspect there is some kind of resource issue when running the tests in parallel. Tom Burton-West > Parallelize Tests > - > > Key: LUCENE-1709 > URL: https://issues.apache.org/jira/browse/LUCENE-1709 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.4.1 >Reporter: Jason Rutherglen >Assignee: Robert Muir > Fix For: 3.1 > > Attachments: LUCENE-1709-2.patch, LUCENE-1709.patch, > LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, > LUCENE-1709.patch, runLuceneTests.py > > Original Estimate: 48h > Remaining Estimate: 48h > > The Lucene tests can be parallelized to make for a faster testing system. > This task from ANT can be used: > http://ant.apache.org/manual/CoreTasks/parallel.html > Previous discussion: > http://www.gossamer-threads.com/lists/lucene/java-dev/69669 > Notes from Mike M.: > {quote} > I'd love to see a clean solution here (the tests are embarrassingly > parallelizable, and we all have machines with good concurrency these > days)... I have a rather hacked up solution now, that uses > "-Dtestpackage=XXX" to split the tests up. > Ideally I would be able to say "use N threads" and it'd do the right > thing... like the -j flag to make. > {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2353) Config incorrectly handles Windows absolute pathnames
[ https://issues.apache.org/jira/browse/LUCENE-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854566#action_12854566 ] Tom Burton-West commented on LUCENE-2353: - Shai, I am having the same issue with the test hanging on my Windows box with both Revision 931573 and Revision 931304 when running the tests from root. Tests hang in WriteLineDocTaskTest, on this line: [junit] > config properties: [junit] directory = RAMDirectory [junit] doc.maker = org.apache.lucene.benchmark.byTask.tasks.WriteLineDocTaskTest$JustDateDocMaker [junit] line.file.out = D:\dev\lucene\lucene-trunk\build\contrib\benchmark\test\W\one-line [junit] --- Should I open a separate JIRA issue about the test? Tom Burton-West > Config incorrectly handles Windows absolute pathnames > - > > Key: LUCENE-2353 > URL: https://issues.apache.org/jira/browse/LUCENE-2353 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/benchmark >Reporter: Shai Erera >Assignee: Shai Erera > Fix For: 3.1 > > Attachments: LUCENE-2353.patch, LUCENE-2353.patch > > > I have no idea how no one ran into this so far, but I tried to execute an > .alg file which used ReutersContentSource and referenced both docs.dir and > work.dir as Windows absolute pathnames (e.g. d:\something). Surprisingly, the > run reported an error of missing content under benchmark\work\something. > I've traced the problem back to Config, where get(String, String) includes > the following code: > {code} > if (sval.indexOf(":") < 0) { > return sval; > } > // first time this prop is extracted by round > int k = sval.indexOf(":"); > String colName = sval.substring(0, k); > sval = sval.substring(k + 1); > ... > {code} > It detects ":" in the value and so it thinks it's a per-round property, thus > stripping "d:" from the value ... fix is very simple: > {code} > if (sval.indexOf(":") < 0) { > return sval; > } else if (sval.indexOf(":\\") >= 0) { > // this previously messed up absolute path names on Windows. Assuming > // there is no real value that starts with \\ > return sval; > } > // first time this prop is extracted by round > int k = sval.indexOf(":"); > String colName = sval.substring(0, k); > sval = sval.substring(k + 1); > {code} > I'll post a patch w/ the above fix + test shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2257) relax the per-segment max unique term limit
[ https://issues.apache.org/jira/browse/LUCENE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832689#action_12832689 ] Tom Burton-West commented on LUCENE-2257: - Hi Michael, Thanks for your help. We mounted one of the indexes with 2.4 billion terms on our dev server and tested with and without the patch. (I discovered that queries containing Korean characters would consistently trigger the bug). With the patch, we don't see any ArrayIndexOutOfBounds exceptions. We are going to do a bit more testing before we put this into production. (We rolled back our production indexes temporarily to indexes that split the terms over 2 segments and therefore didn't trigger the bug). Other than walking though the code in the debugger, is there some systematic way of looking for any other places where an int is used that might also have problems when we have over 2.1x billion terms? Tom > relax the per-segment max unique term limit > --- > > Key: LUCENE-2257 > URL: https://issues.apache.org/jira/browse/LUCENE-2257 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.9.2, 3.0.1, 3.1 > > Attachments: LUCENE-2257.patch > > > Lucene can't handle more than 2.1B (limit of signed 32 bit int) unique terms > in a single segment. > But I think we can improve this to termIndexInterval (default 128) * 2.1B. > There is one place (internal API only) where Lucene uses an int but should > use a long. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2257) relax the per-segment max unique term limit
[ https://issues.apache.org/jira/browse/LUCENE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832054#action_12832054 ] Tom Burton-West commented on LUCENE-2257: - Thanks for the patch Michael, The patch worked fine with CheckIndex. Checkindex worked with an index with 2.49 billion terms. I added commas to the output below: test: terms, freq, prox...OK [2,487,224,745 terms; 23,573,976,855 terms/docs pairs; 97,223,318,067 tokens] We are working on determining how to test it with Solr. The ArrayIndexOutOfBounds exception appears in the logs about for about 1 in every 100 queries. We haven't been able to determine which queries trigger the problem. We are using an older version of Solr with lucene 2.9-dev 779312 - 2009-05-27 17:19:55 . I'm not sure if we can just drop in a later version of lucene with the patch or if I need to patch the older 2.9 dev lucene version that came with our Solr. What do you suggest? What I'm thinking of is to run 10,000 queries against our dev server pointing at one of the large segment indexes with and without the patch. Tom > relax the per-segment max unique term limit > --- > > Key: LUCENE-2257 > URL: https://issues.apache.org/jira/browse/LUCENE-2257 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.9.2, 3.0.1, 3.1 > > Attachments: LUCENE-2257.patch > > > Lucene can't handle more than 2.1B (limit of signed 32 bit int) unique terms > in a single segment. > But I think we can improve this to termIndexInterval (default 128) * 2.1B. > There is one place (internal API only) where Lucene uses an int but should > use a long. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org