Title: indexing problems

Hello all,

This is a pretty complex problem so I try to explain it really superficially first;

The basic problem is that IndexWriter (or actually DocumentWriter) cannot be called before I have set all fields in HTMLDocument. By setting fields I mean Document.addField(new Field(....)); Since fields can only hold Reader or String objects things get complicated. The basic scenario is that I start to parse a HTML document. I get a PipedReader to that file which gets all text content from it. I would also like to scan all meta-tags, title, some comments etc. However, now I am required to _wait_ until they are available and then set them with Document.addField() before I can pass the Document to IndexWriter.

Ok now; Why is this bad? It's bad because PipedReader has limited buffer (1024). If title, meta etc. information isn't found in the first 1024 bytes, pipe starts to block and we cannot parse any further. Please note that we do not want to read the entire file into memory first and then index it (what a waste of time and memory!). It's also annoying that only Reader and String objects are accepted. As we all know, String is final and immutable. Therefore you can't modify the String reference you pass to some object and edit the reference later as I read e.g. the title information from the stream.

How to fix this?

1. alter com.lucene.document.Field to accept StringBuffer as second argument for all methods.
2. alter com.lucene.index.DocumentWriter :

  Reader reader;                          // find or make Reader
  if (field.readerValue() != null)
        reader = field.readerValue();
  else if (field.stringValue() != null)
        reader = new StringReader(field.stringValue());
// ---- begin modifications ----
  else if( field.stringBufferValue() != null )
        reader = new StringReader(field.stringBufferValue().toString());
// ---- end modifications ----
  else
        throw new IllegalArgumentException("field must have either String or Reader value");

Now since we can pass StringBuffer (which is final but mutable) as argument to Field, we can give Document fields by calling addField and pass the Document to indexWriter. Now the trick is that the first field the indexer starts to process is the body-field with Reader. This is the one that takes the longest time. Processing that also allows us to pick up title, meta etc. information and then push that data into StringBuffers we gave as arguments to Field-constructors.

What do you think about this solution and are there any other ways around this? I wouldn't like to edit the package and create "my versions" since it makes future updates much more complicated.

PS. I am aware that the demo-package contains a HTMLParser which contains an ugly kludge for getting around this problem. It is unsuitable for real-life scenarios for various reasons and makes the parsing process unstable.

Best regards,

Jarkko Viinam�ki
Solution Architect
RTSe Finland Oy

mailto:[EMAIL PROTECTED]

Reply via email to