Re: Dmitry's Term Vector stuff, plus some

Doug Cutting Tue, 17 Feb 2004 12:40:14 -0800

Grant Ingersoll wrote:

I agree with your assessment about getting it right the first time. I can make the changes, as I don't think they are that involved and it will benefit me and my employer in the long run if the changes are committed since we won't have reapply the patches every time there is a new release.

Great! Thanks.

It would really speed things up if you can point me to examples of writing the version number (and the logic for ignoring someting of the wrong version) and the compressed format.

The new TermInfosWriter code writes FORMAT, the current version number. This is read by SegmentTermEnum. This is not a great example, since the previous file format didn't support a version number. I added it by using negative numbers for the version number so that it can be distinguished from any valid value at the start of the old format. It will be easier in your case, since back-compatibility is not yet an issue.

In general, the idea is to store a file format version as the first four bytes of each file, e.g., something like:

class MyWriter {
  public static final int FORMAT = 1;


  public write(OutputStream out) {
    out.writeInt(FORMAT);
    ....
  }

}

class MyReader {
  public void read(InputStream in) {
    int format = in.readInt(in);

    if (format > MyWriter.FORMAT) {
      throw new IOException("Unknown format: " + format);
    }

    if (format == 0) {
       ... back-compatibility stuff for format 0
    } else {
      ...  stuff for current version
    }
  }
}

As for prefix compression of strings, check out TermInfosWriter#writeTerm() and SegmentTermEnum#readTerm(). Since vectors contain lexicographically sorted lists of terms in the same field, you can use the same technique.

Hope that helps,

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Dmitry's Term Vector stuff, plus some

Reply via email to