[jira] Updated: (LUCENE-2393) Utility to output total term frequency and df from a lucene index

2010-04-16 Thread Tom Burton-West (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Burton-West updated LUCENE-2393:


Attachment: LUCENE-2393.patch

Updated the HighFreqTermsWithTF to use flex API. 

 I don't understand the flex API well enough yet to determine if I should have 
used DocsEnum.read/DocsEnum.getBulkResult()  to do a bulk read instead of 
DocsEnum.nextDoc() and DocsEnum.freq()..

> Utility to output total term frequency and df from a lucene index
> -
>
> Key: LUCENE-2393
> URL: https://issues.apache.org/jira/browse/LUCENE-2393
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Tom Burton-West
>Priority: Trivial
> Attachments: LUCENE-2393.patch, LUCENE-2393.patch, LUCENE-2393.patch
>
>
> This is a command line utility that takes a field name, term, and index 
> directory and outputs the document frequency for the term and the total 
> number of occurrences of the term in the index (i.e. the sum of the tf of the 
> term for each document).  It is useful for estimating the size of the term's 
> entry in the *prx files and consequent Disk I/O demands

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2393) Utility to output total term frequency and df from a lucene index

2010-04-15 Thread Tom Burton-West (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Burton-West updated LUCENE-2393:


Attachment: LUCENE-2393.patch

New patch includes a (pre-flex ) version of HighFreqTerms that finds the top N 
terms with the highest docFreq and looks up the total term frequency and 
outputs the list of terms sorted by highest term frequency (which approximates 
the largest entries in the *prx files).I'm not sure how to combine the 
GetTermInfo program, with either version of HighFreqTerms  in a way that leads 
to sane command line arguments and argument processing.   I suppose that 
HighFreqTerms could have a flag that turns on or off the inclusion of total 
term frequency.

> Utility to output total term frequency and df from a lucene index
> -
>
> Key: LUCENE-2393
> URL: https://issues.apache.org/jira/browse/LUCENE-2393
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Tom Burton-West
>Priority: Trivial
> Attachments: LUCENE-2393.patch, LUCENE-2393.patch
>
>
> This is a command line utility that takes a field name, term, and index 
> directory and outputs the document frequency for the term and the total 
> number of occurrences of the term in the index (i.e. the sum of the tf of the 
> term for each document).  It is useful for estimating the size of the term's 
> entry in the *prx files and consequent Disk I/O demands

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2393) Utility to output total term frequency and df from a lucene index

2010-04-14 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856965#action_12856965
 ] 

Tom Burton-West edited comment on LUCENE-2393 at 4/14/10 1:26 PM:
--

Patch against recent trunk.   Can someone please suggest an appropriate 
existing unit test to use as a model for creating a unit test for this?   Would 
it be appropriate to include a small index file for testing or is it better to 
programatically create the index file?

  was (Author: tburtonwest):
Patch against recent trunk
  
> Utility to output total term frequency and df from a lucene index
> -
>
> Key: LUCENE-2393
> URL: https://issues.apache.org/jira/browse/LUCENE-2393
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Tom Burton-West
>Priority: Trivial
> Attachments: LUCENE-2393.patch
>
>
> This is a command line utility that takes a field name, term, and index 
> directory and outputs the document frequency for the term and the total 
> number of occurrences of the term in the index (i.e. the sum of the tf of the 
> term for each document).  It is useful for estimating the size of the term's 
> entry in the *prx files and consequent Disk I/O demands

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2393) Utility to output total term frequency and df from a lucene index

2010-04-14 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856967#action_12856967
 ] 

Tom Burton-West commented on LUCENE-2393:
-

For an example of how this utility can be used please see: 
http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-1

> Utility to output total term frequency and df from a lucene index
> -
>
> Key: LUCENE-2393
> URL: https://issues.apache.org/jira/browse/LUCENE-2393
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Tom Burton-West
>Priority: Trivial
> Attachments: LUCENE-2393.patch
>
>
> This is a command line utility that takes a field name, term, and index 
> directory and outputs the document frequency for the term and the total 
> number of occurrences of the term in the index (i.e. the sum of the tf of the 
> term for each document).  It is useful for estimating the size of the term's 
> entry in the *prx files and consequent Disk I/O demands

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2393) Utility to output total term frequency and df from a lucene index

2010-04-14 Thread Tom Burton-West (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Burton-West updated LUCENE-2393:


Attachment: LUCENE-2393.patch

Patch against recent trunk

> Utility to output total term frequency and df from a lucene index
> -
>
> Key: LUCENE-2393
> URL: https://issues.apache.org/jira/browse/LUCENE-2393
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Tom Burton-West
>Priority: Trivial
> Attachments: LUCENE-2393.patch
>
>
> This is a command line utility that takes a field name, term, and index 
> directory and outputs the document frequency for the term and the total 
> number of occurrences of the term in the index (i.e. the sum of the tf of the 
> term for each document).  It is useful for estimating the size of the term's 
> entry in the *prx files and consequent Disk I/O demands

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2393) Utility to output total term frequency and df from a lucene index

2010-04-14 Thread Tom Burton-West (JIRA)
Utility to output total term frequency and df from a lucene index
-

 Key: LUCENE-2393
 URL: https://issues.apache.org/jira/browse/LUCENE-2393
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Tom Burton-West
Priority: Trivial


This is a command line utility that takes a field name, term, and index 
directory and outputs the document frequency for the term and the total number 
of occurrences of the term in the index (i.e. the sum of the tf of the term for 
each document).  It is useful for estimating the size of the term's entry in 
the *prx files and consequent Disk I/O demands

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1709) Parallelize Tests

2010-04-08 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855022#action_12855022
 ] 

Tom Burton-West commented on LUCENE-1709:
-

Hi Robert,

I patched Revision 931708 and ran "ant clean test-contribute" and the tests ran 
just fine.  The patch seems to have solved the problem.

Tom

> Parallelize Tests
> -
>
> Key: LUCENE-1709
> URL: https://issues.apache.org/jira/browse/LUCENE-1709
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Assignee: Robert Muir
> Fix For: 3.1
>
> Attachments: LUCENE-1709-2.patch, LUCENE-1709.patch, 
> LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, 
> LUCENE-1709.patch, runLuceneTests.py
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> The Lucene tests can be parallelized to make for a faster testing system.  
> This task from ANT can be used: 
> http://ant.apache.org/manual/CoreTasks/parallel.html
> Previous discussion: 
> http://www.gossamer-threads.com/lists/lucene/java-dev/69669
> Notes from Mike M.:
> {quote}
> I'd love to see a clean solution here (the tests are embarrassingly
> parallelizable, and we all have machines with good concurrency these
> days)... I have a rather hacked up solution now, that uses
> "-Dtestpackage=XXX" to split the tests up.
> Ideally I would be able to say "use N threads" and it'd do the right
> thing... like the -j flag to make.
> {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1709) Parallelize Tests

2010-04-08 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854960#action_12854960
 ] 

Tom Burton-West commented on LUCENE-1709:
-

This may or may not be a clue to the problem in benchmark.  When I control-C'd 
the hung test, I got the error reported below.
Tom.


[junit] directory = RAMDirectory
[junit] doc.maker = 
org.apache.lucene.benchmark.byTask.tasks.WriteLineDocTaskTest$JustDateDocMaker
[junit] line.file.out = 
C:\cygwin\home\tburtonw\lucene\april07_good\build\contrib\benchmark\test\W\one-line
[junit] ---
[junit] -  ---
[junit] java.io.FileNotFoundException: 
C:\cygwin\home\tburtonw\lucene\april07_good\contrib\benchmark\junitvmwatcher203463231158436475.properties
 (The process cannot access the file because it is being used by another 
process)
[junit] at java.io.FileInputStream.open(Native Method)
[junit] at java.io.FileInputStream.(FileInputStream.java:106)
[junit] at java.io.FileReader.(FileReader.java:55)
[junit] at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTask.executeAsForked(JUnitTask.java:1025)
[junit] at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTask.execute(JUnitTask.java:876)
[junit] at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTask.execute(JUnitTask.java:803)
[junit] at 
org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:288)
[junit] at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
[junit] at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
[junit] at java.lang.reflect.Method.invoke(Method.java:597)
[junit] at 
org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
[junit] at org.apache.tools.ant.Task.perform(Task.java:348)
[junit] at 
org.apache.tools.ant.taskdefs.Sequential.execute(Sequential.java:62)
[junit] at 
org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:288)
[junit] at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
[junit] at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
[junit] at java.lang.reflect.Method.invoke(Method.java:597)
[junit] at 
org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
[junit] at org.apache.tools.ant.Task.perform(Task.java:348)
[junit] at 
org.apache.tools.ant.taskdefs.MacroInstance.execute(MacroInstance.java:394)
[junit] at 
org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:288)
[junit] at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
[junit] at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
[junit] at java.lang.reflect.Method.invoke(Method.java:597)
[junit] at 
org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
[junit] at org.apache.tools.ant.Task.perform(Task.java:348)
[junit] at 
org.apache.tools.ant.taskdefs.Parallel$TaskRunnable.run(Parallel.java:428)
[junit] at java.lang.Thread.run(Thread.java:619)


> Parallelize Tests
> -
>
> Key: LUCENE-1709
> URL: https://issues.apache.org/jira/browse/LUCENE-1709
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Assignee: Robert Muir
> Fix For: 3.1
>
> Attachments: LUCENE-1709-2.patch, LUCENE-1709.patch, 
> LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, 
> LUCENE-1709.patch, runLuceneTests.py
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> The Lucene tests can be parallelized to make for a faster testing system.  
> This task from ANT can be used: 
> http://ant.apache.org/manual/CoreTasks/parallel.html
> Previous discussion: 
> http://www.gossamer-threads.com/lists/lucene/java-dev/69669
> Notes from Mike M.:
> {quote}
> I'd love to see a clean solution here (the tests are embarrassingly
> parallelizable, and we all have machines with good concurrency these
> days)... I have a rather hacked up solution now, that uses
> "-Dtestpackage=XXX" to split the tests up.
> Ideally I would be able to say "use N threads" and it'd do the right
> thing... like the -j flag to make.
> {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1709) Parallelize Tests

2010-04-08 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854957#action_12854957
 ] 

Tom Burton-West commented on LUCENE-1709:
-

I am having the same issue Shai reported in LUCENE-2353 with the parallel tests 
apparently causing the tests to hang on my Windows box with both Revision 
931573 and Revision   931304 when running the tests from root.

Tests  hang in WriteLineDocTaskTest, on this line:
[junit] > config properties:
[junit] directory = RAMDirectory
[junit] doc.maker = 
org.apache.lucene.benchmark.byTask.tasks.WriteLineDocTaskTest$JustDateDocMaker
[junit] line.file.out = 
D:\dev\lucene\lucene-trunk\build\contrib\benchmark\test\W\one-line
[junit] --- 


I just ran the test last night with Revision  931708 and had no problem.   Ran 
it again this morning and got the hanging behavior.  The difference is that 
last night the only thing running on my computer besides a couple of ssh 
terminal windows was the  tests.  Today when I ran the tests and got the 
hanging behavior, I have firefox, outlook, exceed, wordpad open.  The tests are 
taking 98-99.9% of my cpu while hanging.  I suspect there is some kind of 
resource issue when running the tests in parallel.

Tom Burton-West

> Parallelize Tests
> -
>
> Key: LUCENE-1709
> URL: https://issues.apache.org/jira/browse/LUCENE-1709
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Assignee: Robert Muir
> Fix For: 3.1
>
> Attachments: LUCENE-1709-2.patch, LUCENE-1709.patch, 
> LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, 
> LUCENE-1709.patch, runLuceneTests.py
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> The Lucene tests can be parallelized to make for a faster testing system.  
> This task from ANT can be used: 
> http://ant.apache.org/manual/CoreTasks/parallel.html
> Previous discussion: 
> http://www.gossamer-threads.com/lists/lucene/java-dev/69669
> Notes from Mike M.:
> {quote}
> I'd love to see a clean solution here (the tests are embarrassingly
> parallelizable, and we all have machines with good concurrency these
> days)... I have a rather hacked up solution now, that uses
> "-Dtestpackage=XXX" to split the tests up.
> Ideally I would be able to say "use N threads" and it'd do the right
> thing... like the -j flag to make.
> {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2353) Config incorrectly handles Windows absolute pathnames

2010-04-07 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854566#action_12854566
 ] 

Tom Burton-West commented on LUCENE-2353:
-

Shai,

I am having the same issue with the test hanging on my Windows box with both 
Revision 931573 and Revision   931304 when running the tests from root.

Tests hang in WriteLineDocTaskTest, on this line:
[junit] > config properties:
[junit] directory = RAMDirectory
[junit] doc.maker = 
org.apache.lucene.benchmark.byTask.tasks.WriteLineDocTaskTest$JustDateDocMaker
[junit] line.file.out = 
D:\dev\lucene\lucene-trunk\build\contrib\benchmark\test\W\one-line
[junit] ---

Should I open a separate JIRA issue about the test?

Tom Burton-West

> Config incorrectly handles Windows absolute pathnames
> -
>
> Key: LUCENE-2353
> URL: https://issues.apache.org/jira/browse/LUCENE-2353
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/benchmark
>Reporter: Shai Erera
>Assignee: Shai Erera
> Fix For: 3.1
>
> Attachments: LUCENE-2353.patch, LUCENE-2353.patch
>
>
> I have no idea how no one ran into this so far, but I tried to execute an 
> .alg file which used ReutersContentSource and referenced both docs.dir and 
> work.dir as Windows absolute pathnames (e.g. d:\something). Surprisingly, the 
> run reported an error of missing content under benchmark\work\something.
> I've traced the problem back to Config, where get(String, String) includes 
> the following code:
> {code}
> if (sval.indexOf(":") < 0) {
>   return sval;
> }
> // first time this prop is extracted by round
> int k = sval.indexOf(":");
> String colName = sval.substring(0, k);
> sval = sval.substring(k + 1);
> ...
> {code}
> It detects ":" in the value and so it thinks it's a per-round property, thus 
> stripping "d:" from the value ... fix is very simple:
> {code}
> if (sval.indexOf(":") < 0) {
>   return sval;
> } else if (sval.indexOf(":\\") >= 0) {
>   // this previously messed up absolute path names on Windows. Assuming
>   // there is no real value that starts with \\
>   return sval;
> }
> // first time this prop is extracted by round
> int k = sval.indexOf(":");
> String colName = sval.substring(0, k);
> sval = sval.substring(k + 1);
> {code}
> I'll post a patch w/ the above fix + test shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2257) relax the per-segment max unique term limit

2010-02-11 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832689#action_12832689
 ] 

Tom Burton-West commented on LUCENE-2257:
-

Hi Michael,

Thanks for your help. We mounted one of the indexes with 2.4 billion terms on 
our dev server and tested with and without the patch. (I discovered that 
queries containing Korean characters would consistently trigger the bug).   
With the patch, we don't see any ArrayIndexOutOfBounds exceptions.  We are 
going to do a bit more testing before we put this into production. (We rolled 
back our production indexes temporarily to indexes that split the terms over 2 
segments and therefore didn't trigger the bug).

Other than walking though the code in the debugger, is there some systematic 
way of looking for any other places where an int is used that might also have 
problems when we have over 2.1x billion terms?

Tom

> relax the per-segment max unique term limit
> ---
>
> Key: LUCENE-2257
> URL: https://issues.apache.org/jira/browse/LUCENE-2257
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9.2, 3.0.1, 3.1
>
> Attachments: LUCENE-2257.patch
>
>
> Lucene can't handle more than 2.1B (limit of signed 32 bit int) unique terms 
> in a single segment.
> But I think we can improve this to termIndexInterval (default 128) * 2.1B.  
> There is one place (internal API only) where Lucene uses an int but should 
> use a long.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2257) relax the per-segment max unique term limit

2010-02-10 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832054#action_12832054
 ] 

Tom Burton-West commented on LUCENE-2257:
-

Thanks for the patch Michael,

The patch worked fine with CheckIndex.  Checkindex worked with an index with 
2.49 billion terms.
I added commas to the output below:
 test: terms, freq, prox...OK [2,487,224,745 terms; 23,573,976,855 terms/docs 
pairs; 97,223,318,067 tokens]

We are working on determining how to test it with Solr.  The 
ArrayIndexOutOfBounds exception appears in the logs about for about 1 in every 
100 queries.   We haven't been able to determine which queries trigger the 
problem.

We are using an older version of Solr with lucene 2.9-dev 779312 - 2009-05-27 
17:19:55 .  I'm not sure if we can just drop in a later version of lucene with 
the patch or if I need to patch the older 2.9 dev lucene version that came with 
our Solr.   What do you suggest?

What I'm thinking of is to run 10,000 queries against our dev server pointing 
at one of the large segment indexes  with and without the patch.

Tom




> relax the per-segment max unique term limit
> ---
>
> Key: LUCENE-2257
> URL: https://issues.apache.org/jira/browse/LUCENE-2257
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9.2, 3.0.1, 3.1
>
> Attachments: LUCENE-2257.patch
>
>
> Lucene can't handle more than 2.1B (limit of signed 32 bit int) unique terms 
> in a single segment.
> But I think we can improve this to termIndexInterval (default 128) * 2.1B.  
> There is one place (internal API only) where Lucene uses an int but should 
> use a long.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org