Re: stored field compression

Doug Cutting Fri, 14 May 2004 11:32:04 -0700

Dmitry Serebrennikov wrote:

A different approach would be to just allow binary data in fields. That way applications can compress and decompress as they see fit, plus they would be able to store numerical and other data more efficiently.

That's an interesting idea. One could, for convenience and compatibility, add accessor methods to Field that, when you add a String, convert it to UTF-8 bytes, and make stringValue() parse (and possibly cache) a UTF-8 string from the binary value. There'd be another allocation per field read: FieldReader would construct a byte[], then stringValue() would construct a String with a char[]. Right now we only construct a String with a char[] per stringValue(). Perhaps this is moot, especially if we're lazy about constructing the strings and they're cached. That way, for all the fields you don't access you save an allocation.

Then you could also add intValue() and floatValue() methods, etc. which use binary representations. These could speed up lots of stuff.

For easy extensibility you could do something like:

  interface FieldValue {
    byte[] getBytes();
    void setValue(byte[]);
  }

  /** Extracts the value of the field into <code>value</code>.
   * @see FieldValue#setValue()
   */
  void getValue(FieldValue value) {
    value.setValue(getBytes());
  }

  // replace the base Field ctor with:
  public Field(String name, FieldValue value,
               boolean store, boolean index,
               boolean token, boolean vector) {
    ...
    bytes = value.getBytes();
    ...
  }

  public class CompressedTextFieldValue implements FieldValue {
    public CompressedTextFieldValue(String text) { ... }
    public String toString() { ... }
    ...
  }

  public class SerializeableFieldValue implements FieldValue {
    public SerializeableFieldValue(Serializeable) { ... }
    public Serializeable getSerializeable() { ... }
    ...
  }

It could be up to the application to always use the same FieldValue class with an field, or we could add the FieldValue class to the index's FieldInfos...

I'd like to continue to be able avoid storing type information per field instance, and to avoid re-inventing object serialization, but maybe I need to give these up...

Of course, this would then be a per-value compression and probably not as effective as a whole index compression that could be done with the other approaches.

But, since documents are accessed randomly, we can't easily do a lot better for field data.

Doug, what compression algorithm did you have in mind for the actual compression?

I was just thinking gzip. Alternately, one could make it extensible, and tag each item with the compression algorithm, but I think that gets to be a mess. Also, it's good to stick to a standard algorithm, so that perl, c#, C++, etc. ports can easily incorporate the feature.

This feature is primarily intended to make life easier for folks who want to store whole documents in the index. Selective use of gzip would be a huge improvement over the present situation. Alternate compression algorithms might make things a bit better yet, but probably not hugely.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: stored field compression

Reply via email to