[jira] [Commented] (LUCENE-3003) Move UnInvertedField into Lucene core

2012-03-05 Thread Robert Muir (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13222914#comment-13222914
 ] 

Robert Muir commented on LUCENE-3003:
-

3.6 pruning: can we push this out to 4.0 (mark resolved?)

 Move UnInvertedField into Lucene core
 -

 Key: LUCENE-3003
 URL: https://issues.apache.org/jira/browse/LUCENE-3003
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.6, 4.0

 Attachments: LUCENE-3003.patch, LUCENE-3003.patch, 
 byte_size_32-bit-openjdk6.txt


 Solr's UnInvertedField lets you quickly lookup all terms ords for a
 given doc/field.
 Like, FieldCache, it inverts the index to produce this, and creates a
 RAM-resident data structure holding the bits; but, unlike FieldCache,
 it can handle multiple values per doc, and, it does not hold the term
 bytes in RAM.  Rather, it holds only term ords, and then uses
 TermsEnum to resolve ord - term.
 This is great eg for faceting, where you want to use int ords for all
 of your counting, and then only at the end you need to resolve the
 top N ords to their text.
 I think this is a useful core functionality, and we should move most
 of it into Lucene's core.  It's a good complement to FieldCache.  For
 this first baby step, I just move it into core and refactor Solr's
 usage of it.
 After this, as separate issues, I think there are some things we could
 explore/improve:
   * The first-pass that allocates lots of tiny byte[] looks like it
 could be inefficient.  Maybe we could use the byte slices from the
 indexer for this...
   * We can improve the RAM efficiency of the TermIndex: if the codec
 supports ords, and we are operating on one segment, we should just
 use it.  If not, we can use a more RAM-efficient data structure,
 eg an FST mapping to the ord.
   * We may be able to improve on the main byte[] representation by
 using packed ints instead of delta-vInt?
   * Eventually we should fold this ability into docvalues, ie we'd
 write the byte[] image at indexing time, and then loading would be
 fast, instead of uninverting

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3003) Move UnInvertedField into Lucene core

2011-04-01 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13014703#comment-13014703
 ] 

Yonik Seeley commented on LUCENE-3003:
--

bq. Attached: 32-bit results

Ah, bummer.  It's every 8 bytes, but with a 4 byte offset!
I guess we could make it based on if we detect 32 vs 64 bit jvm... but maybe 
first see if anyone has any ideas about how to use something like pagedbytes 
instead.

 Move UnInvertedField into Lucene core
 -

 Key: LUCENE-3003
 URL: https://issues.apache.org/jira/browse/LUCENE-3003
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3003.patch, LUCENE-3003.patch, 
 byte_size_32-bit-openjdk6.txt


 Solr's UnInvertedField lets you quickly lookup all terms ords for a
 given doc/field.
 Like, FieldCache, it inverts the index to produce this, and creates a
 RAM-resident data structure holding the bits; but, unlike FieldCache,
 it can handle multiple values per doc, and, it does not hold the term
 bytes in RAM.  Rather, it holds only term ords, and then uses
 TermsEnum to resolve ord - term.
 This is great eg for faceting, where you want to use int ords for all
 of your counting, and then only at the end you need to resolve the
 top N ords to their text.
 I think this is a useful core functionality, and we should move most
 of it into Lucene's core.  It's a good complement to FieldCache.  For
 this first baby step, I just move it into core and refactor Solr's
 usage of it.
 After this, as separate issues, I think there are some things we could
 explore/improve:
   * The first-pass that allocates lots of tiny byte[] looks like it
 could be inefficient.  Maybe we could use the byte slices from the
 indexer for this...
   * We can improve the RAM efficiency of the TermIndex: if the codec
 supports ords, and we are operating on one segment, we should just
 use it.  If not, we can use a more RAM-efficient data structure,
 eg an FST mapping to the ord.
   * We may be able to improve on the main byte[] representation by
 using packed ints instead of delta-vInt?
   * Eventually we should fold this ability into docvalues, ie we'd
 write the byte[] image at indexing time, and then loading would be
 fast, instead of uninverting

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3003) Move UnInvertedField into Lucene core

2011-04-01 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13014723#comment-13014723
 ] 

Michael McCandless commented on LUCENE-3003:


bq. It is inefficient - but I never saw a way around it since the lists are all 
being built in parallel (due to the fact that we are uninverting).

Lucene's indexer (TermsHashPerField) has precisely this same problem
-- every unique term must point to two (well, one if omitTFAP)
growable byte arrays.  We use slices into a single big (paged)
byte[], where first slice is tiny and can only hold like 5 bytes, but
then points to the next slice which is a bit bigger, etc.

We could look @ refactoring that for this use too...

Though this is just the one-time startup cost.

bq. Another small  easy optimization I hadn't gotten around to yet was to 
lower the indexIntervalBits and make it configurable.

I did make it configurable to the Lucene class (you can pass it in to
ctor), but for Solr I left it using every 128th term.

{quote}
Another small optimization would be to store an array of offsets to 
length-prefixed byte arrays, rather than a BytesRef[]. At least the values are 
already in packed byte arrays via PagedBytes.
{quote}

Both FieldCache and docvalues (branch) store an array-of-terms like
this (the array of offsets is packed ints).

We should also look at using an FST, which'd be the most compact but
the ord - term lookup cost goes up.

Anyway I think we can pursue these cool ideas on new [future]
issues...


 Move UnInvertedField into Lucene core
 -

 Key: LUCENE-3003
 URL: https://issues.apache.org/jira/browse/LUCENE-3003
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3003.patch, LUCENE-3003.patch, 
 byte_size_32-bit-openjdk6.txt


 Solr's UnInvertedField lets you quickly lookup all terms ords for a
 given doc/field.
 Like, FieldCache, it inverts the index to produce this, and creates a
 RAM-resident data structure holding the bits; but, unlike FieldCache,
 it can handle multiple values per doc, and, it does not hold the term
 bytes in RAM.  Rather, it holds only term ords, and then uses
 TermsEnum to resolve ord - term.
 This is great eg for faceting, where you want to use int ords for all
 of your counting, and then only at the end you need to resolve the
 top N ords to their text.
 I think this is a useful core functionality, and we should move most
 of it into Lucene's core.  It's a good complement to FieldCache.  For
 this first baby step, I just move it into core and refactor Solr's
 usage of it.
 After this, as separate issues, I think there are some things we could
 explore/improve:
   * The first-pass that allocates lots of tiny byte[] looks like it
 could be inefficient.  Maybe we could use the byte slices from the
 indexer for this...
   * We can improve the RAM efficiency of the TermIndex: if the codec
 supports ords, and we are operating on one segment, we should just
 use it.  If not, we can use a more RAM-efficient data structure,
 eg an FST mapping to the ord.
   * We may be able to improve on the main byte[] representation by
 using packed ints instead of delta-vInt?
   * Eventually we should fold this ability into docvalues, ie we'd
 write the byte[] image at indexing time, and then loading would be
 fast, instead of uninverting

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3003) Move UnInvertedField into Lucene core

2011-03-31 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13013869#comment-13013869
 ] 

Dawid Weiss commented on LUCENE-3003:
-

For what it's worth, the instrumentation interface allows one to get exact 
allocation sizes of objects. I put together a small spike at 
https://github.com/dweiss/poligon/tree/master/instrumenter that measures the 
actual allocation size of byte[]. On my hotspot, 64-bit, this yields:

{noformat}
byte[0] takes 24 bytes.
byte[1] takes 32 bytes.
byte[2] takes 32 bytes.
byte[3] takes 32 bytes.
byte[4] takes 32 bytes.
byte[5] takes 32 bytes.
byte[6] takes 32 bytes.
byte[7] takes 32 bytes.
byte[8] takes 32 bytes.
byte[9] takes 40 bytes.
byte[10] takes 40 bytes.
byte[11] takes 40 bytes.
...
{noformat}

IBM's VM yields the same (64-bit), but the version of jrockit that I have 
(which may be an old one, but is 64-bit!) yields:

{noformat}
byte[0] takes 16 bytes.
byte[1] takes 24 bytes.
byte[2] takes 24 bytes.
byte[3] takes 24 bytes.
byte[4] takes 24 bytes.
byte[5] takes 24 bytes.
byte[6] takes 24 bytes.
byte[7] takes 24 bytes.
byte[8] takes 24 bytes.
byte[9] takes 32 bytes.
byte[10] takes 32 bytes.
byte[11] takes 32 bytes.
byte[12] takes 32 bytes.
byte[13] takes 32 bytes.
byte[14] takes 32 bytes.
byte[15] takes 32 bytes.
byte[16] takes 32 bytes.
byte[17] takes 40 bytes.
{noformat}

Don't have access to a 32-bit system right now, but if you're keen on checking, 
checkout that github repo and run:

{noformat}
cd instrumenter
mvn package
java -javaagent:target/instrumenter-0.1.0-SNAPSHOT.jar -version
{noformat}

 Move UnInvertedField into Lucene core
 -

 Key: LUCENE-3003
 URL: https://issues.apache.org/jira/browse/LUCENE-3003
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3003.patch, LUCENE-3003.patch


 Solr's UnInvertedField lets you quickly lookup all terms ords for a
 given doc/field.
 Like, FieldCache, it inverts the index to produce this, and creates a
 RAM-resident data structure holding the bits; but, unlike FieldCache,
 it can handle multiple values per doc, and, it does not hold the term
 bytes in RAM.  Rather, it holds only term ords, and then uses
 TermsEnum to resolve ord - term.
 This is great eg for faceting, where you want to use int ords for all
 of your counting, and then only at the end you need to resolve the
 top N ords to their text.
 I think this is a useful core functionality, and we should move most
 of it into Lucene's core.  It's a good complement to FieldCache.  For
 this first baby step, I just move it into core and refactor Solr's
 usage of it.
 After this, as separate issues, I think there are some things we could
 explore/improve:
   * The first-pass that allocates lots of tiny byte[] looks like it
 could be inefficient.  Maybe we could use the byte slices from the
 indexer for this...
   * We can improve the RAM efficiency of the TermIndex: if the codec
 supports ords, and we are operating on one segment, we should just
 use it.  If not, we can use a more RAM-efficient data structure,
 eg an FST mapping to the ord.
   * We may be able to improve on the main byte[] representation by
 using packed ints instead of delta-vInt?
   * Eventually we should fold this ability into docvalues, ie we'd
 write the byte[] image at indexing time, and then loading would be
 fast, instead of uninverting

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3003) Move UnInvertedField into Lucene core

2011-03-31 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13013986#comment-13013986
 ] 

Yonik Seeley commented on LUCENE-3003:
--

Thanks Dawid, this suggests that we could round up to the 8 byte boundary for 
free.


 Move UnInvertedField into Lucene core
 -

 Key: LUCENE-3003
 URL: https://issues.apache.org/jira/browse/LUCENE-3003
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3003.patch, LUCENE-3003.patch


 Solr's UnInvertedField lets you quickly lookup all terms ords for a
 given doc/field.
 Like, FieldCache, it inverts the index to produce this, and creates a
 RAM-resident data structure holding the bits; but, unlike FieldCache,
 it can handle multiple values per doc, and, it does not hold the term
 bytes in RAM.  Rather, it holds only term ords, and then uses
 TermsEnum to resolve ord - term.
 This is great eg for faceting, where you want to use int ords for all
 of your counting, and then only at the end you need to resolve the
 top N ords to their text.
 I think this is a useful core functionality, and we should move most
 of it into Lucene's core.  It's a good complement to FieldCache.  For
 this first baby step, I just move it into core and refactor Solr's
 usage of it.
 After this, as separate issues, I think there are some things we could
 explore/improve:
   * The first-pass that allocates lots of tiny byte[] looks like it
 could be inefficient.  Maybe we could use the byte slices from the
 indexer for this...
   * We can improve the RAM efficiency of the TermIndex: if the codec
 supports ords, and we are operating on one segment, we should just
 use it.  If not, we can use a more RAM-efficient data structure,
 eg an FST mapping to the ord.
   * We may be able to improve on the main byte[] representation by
 using packed ints instead of delta-vInt?
   * Eventually we should fold this ability into docvalues, ie we'd
 write the byte[] image at indexing time, and then loading would be
 fast, instead of uninverting

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3003) Move UnInvertedField into Lucene core

2011-03-30 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13013731#comment-13013731
 ] 

Yonik Seeley commented on LUCENE-3003:
--

bq. I'd guess that pulsing should be 'good enough' most of the time?

This already really pulses I think? If the bytes can fit in an int, they are 
inlined right in the pointer that would normally point out to the byte array.

bq. But for this case (faceting), usually, you have relatively few terms
and many docs per term?

We see everything.  But this structure was more optimized for a high number of 
unique terms, but relatively few per document.  This will perform well on a 
multi-valued author field, but relatively poorly on a large full-text field.




 Move UnInvertedField into Lucene core
 -

 Key: LUCENE-3003
 URL: https://issues.apache.org/jira/browse/LUCENE-3003
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3003.patch, LUCENE-3003.patch


 Solr's UnInvertedField lets you quickly lookup all terms ords for a
 given doc/field.
 Like, FieldCache, it inverts the index to produce this, and creates a
 RAM-resident data structure holding the bits; but, unlike FieldCache,
 it can handle multiple values per doc, and, it does not hold the term
 bytes in RAM.  Rather, it holds only term ords, and then uses
 TermsEnum to resolve ord - term.
 This is great eg for faceting, where you want to use int ords for all
 of your counting, and then only at the end you need to resolve the
 top N ords to their text.
 I think this is a useful core functionality, and we should move most
 of it into Lucene's core.  It's a good complement to FieldCache.  For
 this first baby step, I just move it into core and refactor Solr's
 usage of it.
 After this, as separate issues, I think there are some things we could
 explore/improve:
   * The first-pass that allocates lots of tiny byte[] looks like it
 could be inefficient.  Maybe we could use the byte slices from the
 indexer for this...
   * We can improve the RAM efficiency of the TermIndex: if the codec
 supports ords, and we are operating on one segment, we should just
 use it.  If not, we can use a more RAM-efficient data structure,
 eg an FST mapping to the ord.
   * We may be able to improve on the main byte[] representation by
 using packed ints instead of delta-vInt?
   * Eventually we should fold this ability into docvalues, ie we'd
 write the byte[] image at indexing time, and then loading would be
 fast, instead of uninverting

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3003) Move UnInvertedField into Lucene core

2011-03-30 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13013747#comment-13013747
 ] 

Yonik Seeley commented on LUCENE-3003:
--

bq. The first-pass that allocates lots of tiny byte[] looks like it could be 
inefficient. Maybe we could use the byte slices from the indexer for this...

It is inefficient - but I never saw a way around it since the lists are all 
being built in parallel (due to the fact that we are uninverting).


Another small  easy optimization I hadn't gotten around to yet was to lower 
the indexIntervalBits and make it configurable.  Another small optimization 
would be to store an array of offsets to length-prefixed byte arrays, rather 
than a BytesRef[].  At least the values are already in packed byte arrays via 
PagedBytes.


I'd also love to hear others thoughts on this memory optimization for many 
small byte arrays:
{code}
  // We avoid a doubling strategy to lower memory usage.
  // this faceting method isn't for docs with many terms.
  // In hotspot, objects have 2 words of overhead, then fields, 
rounded up to a 64-bit boundary.
  // TODO: figure out what array lengths we can round up to w/o 
actually using more memory
  // (how much space does a byte[] take up?  Is data preceded by a 
32 bit length only?
  // It should be safe to round up to the nearest 32 bits in any 
case.
  int newLen = (newend + 3)  0xfffc;  // 4 byte alignment
{code}

 Move UnInvertedField into Lucene core
 -

 Key: LUCENE-3003
 URL: https://issues.apache.org/jira/browse/LUCENE-3003
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3003.patch, LUCENE-3003.patch


 Solr's UnInvertedField lets you quickly lookup all terms ords for a
 given doc/field.
 Like, FieldCache, it inverts the index to produce this, and creates a
 RAM-resident data structure holding the bits; but, unlike FieldCache,
 it can handle multiple values per doc, and, it does not hold the term
 bytes in RAM.  Rather, it holds only term ords, and then uses
 TermsEnum to resolve ord - term.
 This is great eg for faceting, where you want to use int ords for all
 of your counting, and then only at the end you need to resolve the
 top N ords to their text.
 I think this is a useful core functionality, and we should move most
 of it into Lucene's core.  It's a good complement to FieldCache.  For
 this first baby step, I just move it into core and refactor Solr's
 usage of it.
 After this, as separate issues, I think there are some things we could
 explore/improve:
   * The first-pass that allocates lots of tiny byte[] looks like it
 could be inefficient.  Maybe we could use the byte slices from the
 indexer for this...
   * We can improve the RAM efficiency of the TermIndex: if the codec
 supports ords, and we are operating on one segment, we should just
 use it.  If not, we can use a more RAM-efficient data structure,
 eg an FST mapping to the ord.
   * We may be able to improve on the main byte[] representation by
 using packed ints instead of delta-vInt?
   * Eventually we should fold this ability into docvalues, ie we'd
 write the byte[] image at indexing time, and then loading would be
 fast, instead of uninverting

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3003) Move UnInvertedField into Lucene core

2011-03-29 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13012659#comment-13012659
 ] 

Jason Rutherglen commented on LUCENE-3003:
--

{quote}Eventually we should fold this ability into docvalues, ie we'd
write the byte[] image at indexing time, and then loading would be
fast, instead of uninverting{quote}

I'd guess that pulsing should be 'good enough' most of the time?  It seems like 
there'll be some overlap in terms of the gains from pulsing vis-à-vis 
DocValues?  

 Move UnInvertedField into Lucene core
 -

 Key: LUCENE-3003
 URL: https://issues.apache.org/jira/browse/LUCENE-3003
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3003.patch


 Solr's UnInvertedField lets you quickly lookup all terms ords for a
 given doc/field.
 Like, FieldCache, it inverts the index to produce this, and creates a
 RAM-resident data structure holding the bits; but, unlike FieldCache,
 it can handle multiple values per doc, and, it does not hold the term
 bytes in RAM.  Rather, it holds only term ords, and then uses
 TermsEnum to resolve ord - term.
 This is great eg for faceting, where you want to use int ords for all
 of your counting, and then only at the end you need to resolve the
 top N ords to their text.
 I think this is a useful core functionality, and we should move most
 of it into Lucene's core.  It's a good complement to FieldCache.  For
 this first baby step, I just move it into core and refactor Solr's
 usage of it.
 After this, as separate issues, I think there are some things we could
 explore/improve:
   * The first-pass that allocates lots of tiny byte[] looks like it
 could be inefficient.  Maybe we could use the byte slices from the
 indexer for this...
   * We can improve the RAM efficiency of the TermIndex: if the codec
 supports ords, and we are operating on one segment, we should just
 use it.  If not, we can use a more RAM-efficient data structure,
 eg an FST mapping to the ord.
   * We may be able to improve on the main byte[] representation by
 using packed ints instead of delta-vInt?
   * Eventually we should fold this ability into docvalues, ie we'd
 write the byte[] image at indexing time, and then loading would be
 fast, instead of uninverting

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3003) Move UnInvertedField into Lucene core

2011-03-29 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13012683#comment-13012683
 ] 

Michael McCandless commented on LUCENE-3003:


bq. I'd guess that pulsing should be 'good enough' most of the time? It seems 
like there'll be some overlap in terms of the gains from pulsing vis-à-vis 
DocValues?

I think Pulsing codec probably doesn't help much here?

Ie Pulsing is good for terms that have only 1 or 2 docs.

But for this case (faceting), usually, you have relatively few terms
and many docs per term?


 Move UnInvertedField into Lucene core
 -

 Key: LUCENE-3003
 URL: https://issues.apache.org/jira/browse/LUCENE-3003
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3003.patch


 Solr's UnInvertedField lets you quickly lookup all terms ords for a
 given doc/field.
 Like, FieldCache, it inverts the index to produce this, and creates a
 RAM-resident data structure holding the bits; but, unlike FieldCache,
 it can handle multiple values per doc, and, it does not hold the term
 bytes in RAM.  Rather, it holds only term ords, and then uses
 TermsEnum to resolve ord - term.
 This is great eg for faceting, where you want to use int ords for all
 of your counting, and then only at the end you need to resolve the
 top N ords to their text.
 I think this is a useful core functionality, and we should move most
 of it into Lucene's core.  It's a good complement to FieldCache.  For
 this first baby step, I just move it into core and refactor Solr's
 usage of it.
 After this, as separate issues, I think there are some things we could
 explore/improve:
   * The first-pass that allocates lots of tiny byte[] looks like it
 could be inefficient.  Maybe we could use the byte slices from the
 indexer for this...
   * We can improve the RAM efficiency of the TermIndex: if the codec
 supports ords, and we are operating on one segment, we should just
 use it.  If not, we can use a more RAM-efficient data structure,
 eg an FST mapping to the ord.
   * We may be able to improve on the main byte[] representation by
 using packed ints instead of delta-vInt?
   * Eventually we should fold this ability into docvalues, ie we'd
 write the byte[] image at indexing time, and then loading would be
 fast, instead of uninverting

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3003) Move UnInvertedField into Lucene core

2011-03-29 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13012687#comment-13012687
 ] 

Jason Rutherglen commented on LUCENE-3003:
--

bq. Ie Pulsing is good for terms that have only 1 or 2 docs

I thought the default is 16 docs?  If there are more then seek'ing to the 
postings should be negligible (in comparison to a larger aggregate index size 
when using CSF/DocValues, which'll consume more of the system IO cache)?

 Move UnInvertedField into Lucene core
 -

 Key: LUCENE-3003
 URL: https://issues.apache.org/jira/browse/LUCENE-3003
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3003.patch


 Solr's UnInvertedField lets you quickly lookup all terms ords for a
 given doc/field.
 Like, FieldCache, it inverts the index to produce this, and creates a
 RAM-resident data structure holding the bits; but, unlike FieldCache,
 it can handle multiple values per doc, and, it does not hold the term
 bytes in RAM.  Rather, it holds only term ords, and then uses
 TermsEnum to resolve ord - term.
 This is great eg for faceting, where you want to use int ords for all
 of your counting, and then only at the end you need to resolve the
 top N ords to their text.
 I think this is a useful core functionality, and we should move most
 of it into Lucene's core.  It's a good complement to FieldCache.  For
 this first baby step, I just move it into core and refactor Solr's
 usage of it.
 After this, as separate issues, I think there are some things we could
 explore/improve:
   * The first-pass that allocates lots of tiny byte[] looks like it
 could be inefficient.  Maybe we could use the byte slices from the
 indexer for this...
   * We can improve the RAM efficiency of the TermIndex: if the codec
 supports ords, and we are operating on one segment, we should just
 use it.  If not, we can use a more RAM-efficient data structure,
 eg an FST mapping to the ord.
   * We may be able to improve on the main byte[] representation by
 using packed ints instead of delta-vInt?
   * Eventually we should fold this ability into docvalues, ie we'd
 write the byte[] image at indexing time, and then loading would be
 fast, instead of uninverting

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3003) Move UnInvertedField into Lucene core

2011-03-29 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13012810#comment-13012810
 ] 

Chris Male commented on LUCENE-3003:


+1 to committing this change then tackling the improvements separately.

 Move UnInvertedField into Lucene core
 -

 Key: LUCENE-3003
 URL: https://issues.apache.org/jira/browse/LUCENE-3003
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3003.patch


 Solr's UnInvertedField lets you quickly lookup all terms ords for a
 given doc/field.
 Like, FieldCache, it inverts the index to produce this, and creates a
 RAM-resident data structure holding the bits; but, unlike FieldCache,
 it can handle multiple values per doc, and, it does not hold the term
 bytes in RAM.  Rather, it holds only term ords, and then uses
 TermsEnum to resolve ord - term.
 This is great eg for faceting, where you want to use int ords for all
 of your counting, and then only at the end you need to resolve the
 top N ords to their text.
 I think this is a useful core functionality, and we should move most
 of it into Lucene's core.  It's a good complement to FieldCache.  For
 this first baby step, I just move it into core and refactor Solr's
 usage of it.
 After this, as separate issues, I think there are some things we could
 explore/improve:
   * The first-pass that allocates lots of tiny byte[] looks like it
 could be inefficient.  Maybe we could use the byte slices from the
 indexer for this...
   * We can improve the RAM efficiency of the TermIndex: if the codec
 supports ords, and we are operating on one segment, we should just
 use it.  If not, we can use a more RAM-efficient data structure,
 eg an FST mapping to the ord.
   * We may be able to improve on the main byte[] representation by
 using packed ints instead of delta-vInt?
   * Eventually we should fold this ability into docvalues, ie we'd
 write the byte[] image at indexing time, and then loading would be
 fast, instead of uninverting

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org