[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

2011-06-27 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055437#comment-13055437
 ] 

Koji Sekiguchi commented on SOLR-2583:
--

I'd like the feature as I'm using ExternalFileField a lot!

bq. what do you say regarding the suggestion to use HashMap up to ~5.5% and 
above that using the float[]?

Looking at your test, I think it is reasonable. But I'd like to use 
CompactByteArray. I saw it wins over HashMap and float[] when 5% and above in 
my test.

How about introducing compact=yes (default is no and float[] is used) with 
sparse=yes/no/auto?

 Make external scoring more efficient (ExternalFileField, FileFloatSource)
 -

 Key: SOLR-2583
 URL: https://issues.apache.org/jira/browse/SOLR-2583
 Project: Solr
  Issue Type: Improvement
  Components: search
Reporter: Martin Grotzke
Priority: Minor
 Attachments: FileFloatSource.java.patch, patch.txt


 External scoring eats much memory, depending on the number of documents in 
 the index. The ExternalFileField (used for external scoring) uses 
 FileFloatSource, where one FileFloatSource is created per external scoring 
 file. FileFloatSource creates a float array with the size of the number of 
 docs (this is also done if the file to load is not found). If there are much 
 less entries in the scoring file than there are number of docs in total the 
 big float array wastes much memory.
 This could be optimized by using a map of doc - score, so that the map 
 contains as many entries as there are scoring entries in the external file, 
 but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

2011-06-27 Thread Martin Grotzke (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055737#comment-13055737
 ] 

Martin Grotzke commented on SOLR-2583:
--

bq. Looking at your test, I think it is reasonable. But I'd like to use 
CompactByteArray. I saw it wins over HashMap and float[] when 5% and above in 
my test.

Can you share your test code or s.th. similar? Perhaps you can just fork 
https://github.com/magro/lucene-solr/ and add an appropriate test that reflects 
your data?

 Make external scoring more efficient (ExternalFileField, FileFloatSource)
 -

 Key: SOLR-2583
 URL: https://issues.apache.org/jira/browse/SOLR-2583
 Project: Solr
  Issue Type: Improvement
  Components: search
Reporter: Martin Grotzke
Priority: Minor
 Attachments: FileFloatSource.java.patch, patch.txt


 External scoring eats much memory, depending on the number of documents in 
 the index. The ExternalFileField (used for external scoring) uses 
 FileFloatSource, where one FileFloatSource is created per external scoring 
 file. FileFloatSource creates a float array with the size of the number of 
 docs (this is also done if the file to load is not found). If there are much 
 less entries in the scoring file than there are number of docs in total the 
 big float array wastes much memory.
 This could be optimized by using a map of doc - score, so that the map 
 contains as many entries as there are scoring entries in the external file, 
 but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

2011-06-27 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13056224#comment-13056224
 ] 

Koji Sekiguchi commented on SOLR-2583:
--

I didn't save the test snippet because I wrote it out of my office (I used 
stranger's PC). What I did was just using CompactByteArray instead of 
CompactFloatArray in your FileFloatSourceMemoryTest.java.


 Make external scoring more efficient (ExternalFileField, FileFloatSource)
 -

 Key: SOLR-2583
 URL: https://issues.apache.org/jira/browse/SOLR-2583
 Project: Solr
  Issue Type: Improvement
  Components: search
Reporter: Martin Grotzke
Priority: Minor
 Attachments: FileFloatSource.java.patch, patch.txt


 External scoring eats much memory, depending on the number of documents in 
 the index. The ExternalFileField (used for external scoring) uses 
 FileFloatSource, where one FileFloatSource is created per external scoring 
 file. FileFloatSource creates a float array with the size of the number of 
 docs (this is also done if the file to load is not found). If there are much 
 less entries in the scoring file than there are number of docs in total the 
 big float array wastes much memory.
 This could be optimized by using a map of doc - score, so that the map 
 contains as many entries as there are scoring entries in the external file, 
 but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

2011-06-16 Thread Martin Grotzke (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13050435#comment-13050435
 ] 

Martin Grotzke commented on SOLR-2583:
--

bq. Are you sure real floats are actually needed?
In our case score values are e.g. 15887 (one example just taken from one of 
the files). With this sample this test fails:
{noformat}
byte small = SmallFloat.floatToByte315(104626500f);
assertEquals(104626500f, SmallFloat.byte315ToFloat(small), 0f);
- AssertionError: expected:1.04626496E8 but was:1.00663296E8
{noformat}

This shows that even we have a case where this will produce wrong results, and 
even if we could fix this in our case there might be someone else with the same 
issue.


bq. it would also good to measure performance...
I'd not expect that the boxing makes a real difference here, especially in 
relation to the rest of the time spent during a search request.
A time based performance comparison that has a real value would take some time, 
it would have to put in relation to the rest of a search request (how do you do 
this?) and finally it would require proper interpretation when everything is 
together. Right now I don't think it's worth the effort.


{quote}
bq. that uses a fixed size and an increasing number of puts
I'm not certain how realistic that is, remember behind the scenes 
compactbytearray uses blocks,
and if you touch every one (by putting every K docid or something) then you are 
just testing
the worst case.
{quote}
Do you want to change the test to s.th. that's more realistic?


@Yonik: what do you say regarding the suggestion to use HashMap up to ~5.5% and 
above that using the float[]?

 Make external scoring more efficient (ExternalFileField, FileFloatSource)
 -

 Key: SOLR-2583
 URL: https://issues.apache.org/jira/browse/SOLR-2583
 Project: Solr
  Issue Type: Improvement
  Components: search
Reporter: Martin Grotzke
Priority: Minor
 Attachments: FileFloatSource.java.patch, patch.txt


 External scoring eats much memory, depending on the number of documents in 
 the index. The ExternalFileField (used for external scoring) uses 
 FileFloatSource, where one FileFloatSource is created per external scoring 
 file. FileFloatSource creates a float array with the size of the number of 
 docs (this is also done if the file to load is not found). If there are much 
 less entries in the scoring file than there are number of docs in total the 
 big float array wastes much memory.
 This could be optimized by using a map of doc - score, so that the map 
 contains as many entries as there are scoring entries in the external file, 
 but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

2011-06-15 Thread Martin Grotzke (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13049674#comment-13049674
 ] 

Martin Grotzke commented on SOLR-2583:
--

The test that produced this output can be found in my lucene-solr fork on 
github: https://github.com/magro/lucene-solr/commit/b9af87b1
The test method that was executed was testCompareMemoryUsage, for measuring 
memory usage I used http://code.google.com/p/memory-measurer/ and ran the 
test/jvm with -Xmx1G -javaagent:solr/lib/object-explorer.jar (just from 
eclipse).

I just added another test, that uses a fixed size and an increasing number of 
puts (testCompareMemoryUsageWithFixSizeAndIncreasingNumPuts, 
https://github.com/magro/lucene-solr/blob/trunk/solr/src/test/org/apache/solr/search/function/FileFloatSourceMemoryTest.java#L56),
 with the following results:

{noformat}
Size: 100
NumPuts 1.000 (0,1%),   CompactFloatArray 918.616,  float[] 
4.000.016,  HashMap  72.128
NumPuts 10.000 (1,0%),  CompactFloatArray 3.738.712,float[] 
4.000.016,  HashMap  701.696
NumPuts 50.000 (5,0%),  CompactFloatArray 4.016.472,float[] 
4.000.016,  HashMap  3.383.104
NumPuts 55.000 (5,5%),  CompactFloatArray 4.016.472,float[] 
4.000.016,  HashMap  3.949.120
NumPuts 60.000 (6,0%),  CompactFloatArray 4.016.472,float[] 
4.000.016,  HashMap  4.254.848
NumPuts 100.000 (10,0%),CompactFloatArray 4.016.472,float[] 
4.000.016,  HashMap  6.622.272
NumPuts 500.000 (50,0%),CompactFloatArray 4.016.472,float[] 
4.000.016,  HashMap  27.262.976
NumPuts 1.000.000 (100,0%), CompactFloatArray 4.016.472,float[] 
4.000.016,  HashMap  44.649.664
{noformat}

It seems that the HashMap is the most efficient solution up to ~5.5%. Starting 
from this threshold CompactFloatArray and float[] use less memory, while the 
CompactFloatArray has no advantages over float[] for puts  5%.

Therefore I'd suggest that we use an adaptive strategy that uses a HashMap up 
to 5,5% of number of scores compared to numdocs, and starting from this 
threshold the original float[] approach is used.

What do you say?

 Make external scoring more efficient (ExternalFileField, FileFloatSource)
 -

 Key: SOLR-2583
 URL: https://issues.apache.org/jira/browse/SOLR-2583
 Project: Solr
  Issue Type: Improvement
  Components: search
Reporter: Martin Grotzke
Priority: Minor
 Attachments: FileFloatSource.java.patch, patch.txt


 External scoring eats much memory, depending on the number of documents in 
 the index. The ExternalFileField (used for external scoring) uses 
 FileFloatSource, where one FileFloatSource is created per external scoring 
 file. FileFloatSource creates a float array with the size of the number of 
 docs (this is also done if the file to load is not found). If there are much 
 less entries in the scoring file than there are number of docs in total the 
 big float array wastes much memory.
 This could be optimized by using a map of doc - score, so that the map 
 contains as many entries as there are scoring entries in the external file, 
 but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

2011-06-15 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13049706#comment-13049706
 ] 

Robert Muir commented on SOLR-2583:
---

Are you sure real floats are actually needed?
Why not use compactbytearray with smallfloat encoding?

it would also good to measure performance... doesn't a hashmap have to box 
*per-docid* into an Integer for lookup?



 Make external scoring more efficient (ExternalFileField, FileFloatSource)
 -

 Key: SOLR-2583
 URL: https://issues.apache.org/jira/browse/SOLR-2583
 Project: Solr
  Issue Type: Improvement
  Components: search
Reporter: Martin Grotzke
Priority: Minor
 Attachments: FileFloatSource.java.patch, patch.txt


 External scoring eats much memory, depending on the number of documents in 
 the index. The ExternalFileField (used for external scoring) uses 
 FileFloatSource, where one FileFloatSource is created per external scoring 
 file. FileFloatSource creates a float array with the size of the number of 
 docs (this is also done if the file to load is not found). If there are much 
 less entries in the scoring file than there are number of docs in total the 
 big float array wastes much memory.
 This could be optimized by using a map of doc - score, so that the map 
 contains as many entries as there are scoring entries in the external file, 
 but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

2011-06-15 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13049709#comment-13049709
 ] 

Robert Muir commented on SOLR-2583:
---

bq. that uses a fixed size and an increasing number of puts

I'm not certain how realistic that is, remember behind the scenes 
compactbytearray uses blocks,
and if you touch every one (by putting every K docid or something) then you are 
just testing 
the worst case.


 Make external scoring more efficient (ExternalFileField, FileFloatSource)
 -

 Key: SOLR-2583
 URL: https://issues.apache.org/jira/browse/SOLR-2583
 Project: Solr
  Issue Type: Improvement
  Components: search
Reporter: Martin Grotzke
Priority: Minor
 Attachments: FileFloatSource.java.patch, patch.txt


 External scoring eats much memory, depending on the number of documents in 
 the index. The ExternalFileField (used for external scoring) uses 
 FileFloatSource, where one FileFloatSource is created per external scoring 
 file. FileFloatSource creates a float array with the size of the number of 
 docs (this is also done if the file to load is not found). If there are much 
 less entries in the scoring file than there are number of docs in total the 
 big float array wastes much memory.
 This could be optimized by using a map of doc - score, so that the map 
 contains as many entries as there are scoring entries in the external file, 
 but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

2011-06-14 Thread Martin Grotzke (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13049143#comment-13049143
 ] 

Martin Grotzke commented on SOLR-2583:
--

{quote}
See: http://www.strchr.com/multi-stage_tables

i attached a patch, of a (not great) implementation i was sorta kinda trying to 
clean up for other reasons... maybe you can use it.
{quote}

Thanx, interesting approach!

I just tried to create a CompactFloatArray based on the CompactByteArray to be 
able to compare memory consumptions. There's one change that wasn't just 
changing byte to float, and I'm not sure what's the right adaption in this case:

{code}
diff -w solr/src/java/org/apache/solr/util/CompactByteArray.java 
solr/src/java/org/apache/solr/util/CompactFloatArray.java
57c57
...
202,203c202,203
   private void touchBlock(int i, int value) {
 hashes[i] = (hashes[i] + (value  1)) | 1;
---
   private void touchBlock(int i, float value) {
 hashes[i] = (hashes[i] + (Float.floatToIntBits(value)  1)) | 1;
{code}

The adapted test is green, so it seems to be correct at least. I'll also attach 
the full patch for CompactFloatArray.java and TestCompactFloatArray.java

 Make external scoring more efficient (ExternalFileField, FileFloatSource)
 -

 Key: SOLR-2583
 URL: https://issues.apache.org/jira/browse/SOLR-2583
 Project: Solr
  Issue Type: Improvement
  Components: search
Reporter: Martin Grotzke
Priority: Minor
 Attachments: FileFloatSource.java.patch, patch.txt


 External scoring eats much memory, depending on the number of documents in 
 the index. The ExternalFileField (used for external scoring) uses 
 FileFloatSource, where one FileFloatSource is created per external scoring 
 file. FileFloatSource creates a float array with the size of the number of 
 docs (this is also done if the file to load is not found). If there are much 
 less entries in the scoring file than there are number of docs in total the 
 big float array wastes much memory.
 This could be optimized by using a map of doc - score, so that the map 
 contains as many entries as there are scoring entries in the external file, 
 but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

2011-06-14 Thread Martin Grotzke (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13049256#comment-13049256
 ] 

Martin Grotzke commented on SOLR-2583:
--

I just compared memory consumption of the 3 different approaches, with 
different number of puts (number of scores) and sizes (number of docs):

{noformat}
Puts  1.000, size 1.000.000:  CompactFloatArray 898.136,float[] 
4.000.016,  HashMap  72.192
Puts  10.000, size 1.000.000: CompactFloatArray 3.724.376,  float[] 
4.000.016,  HashMap  702.784
Puts  100.000, size 1.000.000:CompactFloatArray 4.016.472,  float[] 
4.000.016,  HashMap  6.607.808
Puts  1.000.000, size 1.000.000:  CompactFloatArray 4.016.472,  float[] 
4.000.016,  HashMap  44.644.032
Puts  1.000, size 5.000.000:  CompactFloatArray 1.128.536,  float[] 
20.000.016, HashMap  72.256
Puts  10.000, size 5.000.000: CompactFloatArray 8.168.536,  float[] 
20.000.016, HashMap  704.832
Puts  100.000, size 5.000.000:CompactFloatArray 20.013.144, float[] 
20.000.016, HashMap  7.385.152
Puts  1.000.000, size 5.000.000:  CompactFloatArray 20.131.160, float[] 
20.000.016, HashMap  66.395.584
Puts  1.000, size 10.000.000: CompactFloatArray 1.275.992,  float[] 
40.000.016, HashMap  72.256
Puts  10.000, size 10.000.000:CompactFloatArray 9.289.816,  float[] 
40.000.016, HashMap  705.280
Puts  100.000, size 10.000.000:   CompactFloatArray 37.130.328, float[] 
40.000.016, HashMap  7.418.112
Puts  1.000.000, size 10.000.000: CompactFloatArray 40.262.232, float[] 
40.000.016, HashMap  69.282.496
{noformat}

I want to share this intermediately, without further interpretation/conclusion 
for now (I just need to get the train).

 Make external scoring more efficient (ExternalFileField, FileFloatSource)
 -

 Key: SOLR-2583
 URL: https://issues.apache.org/jira/browse/SOLR-2583
 Project: Solr
  Issue Type: Improvement
  Components: search
Reporter: Martin Grotzke
Priority: Minor
 Attachments: FileFloatSource.java.patch, patch.txt


 External scoring eats much memory, depending on the number of documents in 
 the index. The ExternalFileField (used for external scoring) uses 
 FileFloatSource, where one FileFloatSource is created per external scoring 
 file. FileFloatSource creates a float array with the size of the number of 
 docs (this is also done if the file to load is not found). If there are much 
 less entries in the scoring file than there are number of docs in total the 
 big float array wastes much memory.
 This could be optimized by using a map of doc - score, so that the map 
 contains as many entries as there are scoring entries in the external file, 
 but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

2011-06-09 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046564#comment-13046564
 ] 

Yonik Seeley commented on SOLR-2583:


Yeah, this will help for sparse fields, but hurt quite a bit for non-sparse 
ones.
Seems like we should make it an option (sparse=true/false on the fieldType 
definition)?

 Make external scoring more efficient (ExternalFileField, FileFloatSource)
 -

 Key: SOLR-2583
 URL: https://issues.apache.org/jira/browse/SOLR-2583
 Project: Solr
  Issue Type: Improvement
  Components: search
Reporter: Martin Grotzke
Priority: Minor
 Attachments: FileFloatSource.java.patch


 External scoring eats much memory, depending on the number of documents in 
 the index. The ExternalFileField (used for external scoring) uses 
 FileFloatSource, where one FileFloatSource is created per external scoring 
 file. FileFloatSource creates a float array with the size of the number of 
 docs (this is also done if the file to load is not found). If there are much 
 less entries in the scoring file than there are number of docs in total the 
 big float array wastes much memory.
 This could be optimized by using a map of doc - score, so that the map 
 contains as many entries as there are scoring entries in the external file, 
 but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

2011-06-09 Thread Martin Grotzke (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046674#comment-13046674
 ] 

Martin Grotzke commented on SOLR-2583:
--

Yes, you're right regarding non-sparse fields. The question for the user will 
be when to use true or false for sparse. It might also be the case, that files 
differ, in that some are big, others are small. So I'm thinking about making it 
adaptive: when the number of lines reach a certain percentage compared to the 
number of docs, the float array is used, otherwise the doc-score map is used. 
Perhaps it would be good to allow the user to override this, s.th. like 
sparse=yes/no/auto.

What do you think?

 Make external scoring more efficient (ExternalFileField, FileFloatSource)
 -

 Key: SOLR-2583
 URL: https://issues.apache.org/jira/browse/SOLR-2583
 Project: Solr
  Issue Type: Improvement
  Components: search
Reporter: Martin Grotzke
Priority: Minor
 Attachments: FileFloatSource.java.patch


 External scoring eats much memory, depending on the number of documents in 
 the index. The ExternalFileField (used for external scoring) uses 
 FileFloatSource, where one FileFloatSource is created per external scoring 
 file. FileFloatSource creates a float array with the size of the number of 
 docs (this is also done if the file to load is not found). If there are much 
 less entries in the scoring file than there are number of docs in total the 
 big float array wastes much memory.
 This could be optimized by using a map of doc - score, so that the map 
 contains as many entries as there are scoring entries in the external file, 
 but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

2011-06-09 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046675#comment-13046675
 ] 

Robert Muir commented on SOLR-2583:
---

a smallfloat option could help too? (1/4 the ram)

 Make external scoring more efficient (ExternalFileField, FileFloatSource)
 -

 Key: SOLR-2583
 URL: https://issues.apache.org/jira/browse/SOLR-2583
 Project: Solr
  Issue Type: Improvement
  Components: search
Reporter: Martin Grotzke
Priority: Minor
 Attachments: FileFloatSource.java.patch


 External scoring eats much memory, depending on the number of documents in 
 the index. The ExternalFileField (used for external scoring) uses 
 FileFloatSource, where one FileFloatSource is created per external scoring 
 file. FileFloatSource creates a float array with the size of the number of 
 docs (this is also done if the file to load is not found). If there are much 
 less entries in the scoring file than there are number of docs in total the 
 big float array wastes much memory.
 This could be optimized by using a map of doc - score, so that the map 
 contains as many entries as there are scoring entries in the external file, 
 but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

2011-06-09 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046688#comment-13046688
 ] 

Yonik Seeley commented on SOLR-2583:


bq. Perhaps it would be good to allow the user to override this, s.th. like 
sparse=yes/no/auto.

Sounds good!  I wonder what the memory cut-off should be for auto... 10% of 
maxDoc() or so?

bq. a smallfloat option could help too? (1/4 the ram)

Yep!

 Make external scoring more efficient (ExternalFileField, FileFloatSource)
 -

 Key: SOLR-2583
 URL: https://issues.apache.org/jira/browse/SOLR-2583
 Project: Solr
  Issue Type: Improvement
  Components: search
Reporter: Martin Grotzke
Priority: Minor
 Attachments: FileFloatSource.java.patch


 External scoring eats much memory, depending on the number of documents in 
 the index. The ExternalFileField (used for external scoring) uses 
 FileFloatSource, where one FileFloatSource is created per external scoring 
 file. FileFloatSource creates a float array with the size of the number of 
 docs (this is also done if the file to load is not found). If there are much 
 less entries in the scoring file than there are number of docs in total the 
 big float array wastes much memory.
 This could be optimized by using a map of doc - score, so that the map 
 contains as many entries as there are scoring entries in the external file, 
 but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

2011-06-09 Thread Martin Grotzke (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046692#comment-13046692
 ] 

Martin Grotzke commented on SOLR-2583:
--

Great, sounds like a further optimization for both sparse and non-sparse files. 
Though, as we had 4GB taken by FileFloatSource objects a reduction to 1/4 would 
still be too much for us so for our case I prefer the map based approach - then 
with Smallfloat.

 Make external scoring more efficient (ExternalFileField, FileFloatSource)
 -

 Key: SOLR-2583
 URL: https://issues.apache.org/jira/browse/SOLR-2583
 Project: Solr
  Issue Type: Improvement
  Components: search
Reporter: Martin Grotzke
Priority: Minor
 Attachments: FileFloatSource.java.patch


 External scoring eats much memory, depending on the number of documents in 
 the index. The ExternalFileField (used for external scoring) uses 
 FileFloatSource, where one FileFloatSource is created per external scoring 
 file. FileFloatSource creates a float array with the size of the number of 
 docs (this is also done if the file to load is not found). If there are much 
 less entries in the scoring file than there are number of docs in total the 
 big float array wastes much memory.
 This could be optimized by using a map of doc - score, so that the map 
 contains as many entries as there are scoring entries in the external file, 
 but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

2011-06-09 Thread Martin Grotzke (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046712#comment-13046712
 ] 

Martin Grotzke commented on SOLR-2583:
--

 Sounds good!  I wonder what the memory cut-off should be for auto... 10% of 
 maxDoc() or so?

I'd compare both strategies to see what's the break-even, this should give an 
absolute number.

 Make external scoring more efficient (ExternalFileField, FileFloatSource)
 -

 Key: SOLR-2583
 URL: https://issues.apache.org/jira/browse/SOLR-2583
 Project: Solr
  Issue Type: Improvement
  Components: search
Reporter: Martin Grotzke
Priority: Minor
 Attachments: FileFloatSource.java.patch


 External scoring eats much memory, depending on the number of documents in 
 the index. The ExternalFileField (used for external scoring) uses 
 FileFloatSource, where one FileFloatSource is created per external scoring 
 file. FileFloatSource creates a float array with the size of the number of 
 docs (this is also done if the file to load is not found). If there are much 
 less entries in the scoring file than there are number of docs in total the 
 big float array wastes much memory.
 This could be optimized by using a map of doc - score, so that the map 
 contains as many entries as there are scoring entries in the external file, 
 but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

2011-06-09 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046785#comment-13046785
 ] 

Robert Muir commented on SOLR-2583:
---

bq. Though, as we had 4GB taken by FileFloatSource objects a reduction to 1/4 
would still be too much for us so for our case I prefer the map based approach 
- then with Smallfloat.

If the problem is sparsity, maybe use a two-stage table, still faster than a 
hashmap and much better for the worst case.


 Make external scoring more efficient (ExternalFileField, FileFloatSource)
 -

 Key: SOLR-2583
 URL: https://issues.apache.org/jira/browse/SOLR-2583
 Project: Solr
  Issue Type: Improvement
  Components: search
Reporter: Martin Grotzke
Priority: Minor
 Attachments: FileFloatSource.java.patch


 External scoring eats much memory, depending on the number of documents in 
 the index. The ExternalFileField (used for external scoring) uses 
 FileFloatSource, where one FileFloatSource is created per external scoring 
 file. FileFloatSource creates a float array with the size of the number of 
 docs (this is also done if the file to load is not found). If there are much 
 less entries in the scoring file than there are number of docs in total the 
 big float array wastes much memory.
 This could be optimized by using a map of doc - score, so that the map 
 contains as many entries as there are scoring entries in the external file, 
 but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

2011-06-09 Thread Martin Grotzke (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046943#comment-13046943
 ] 

Martin Grotzke commented on SOLR-2583:
--

bq. If the problem is sparsity, maybe use a two-stage table, still faster than 
a hashmap and much better for the worst case.

What do you mean with a two-stage table, can you clarify this please?

 Make external scoring more efficient (ExternalFileField, FileFloatSource)
 -

 Key: SOLR-2583
 URL: https://issues.apache.org/jira/browse/SOLR-2583
 Project: Solr
  Issue Type: Improvement
  Components: search
Reporter: Martin Grotzke
Priority: Minor
 Attachments: FileFloatSource.java.patch


 External scoring eats much memory, depending on the number of documents in 
 the index. The ExternalFileField (used for external scoring) uses 
 FileFloatSource, where one FileFloatSource is created per external scoring 
 file. FileFloatSource creates a float array with the size of the number of 
 docs (this is also done if the file to load is not found). If there are much 
 less entries in the scoring file than there are number of docs in total the 
 big float array wastes much memory.
 This could be optimized by using a map of doc - score, so that the map 
 contains as many entries as there are scoring entries in the external file, 
 but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org