Hi Grant: Thanks for the options. How likely will the lucene file formats change?
Are there really no more optiosn? :(... Thanks -John On Thu, 08 Jul 2004 08:50:44 -0400, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > Hi John, > > The source code is available from CVS, make it non-final and do what you need to do. > Of course, you may have a hard time finding help later if you aren't using > something everyone else is and your solution doesn't work... :-) > > If I understand correctly what you are trying to do, you already know all of the > answers for indexing, you just want Lucene to do the retrieval side of the coin, > correct? I suppose a crazy idea might be to write a program that took your info and > output it in the Lucene file format, but that seems a bit like overkill. > > -Grant > > >>> [EMAIL PROTECTED] 07/07/04 07:37PM >>> > > > Hi Doug: > Thanks for the response! > > The solution you proposed is still a derivative of creating a > dummy document stream. Taking the same example, java (5), lucene (6), > VectorTokenStream would create a total of 11 Tokens whereas only 2 is > neccessary. > > Given many documents with many terms and frequencies, it would > create many extra Token instances. > > The reason I was looking to derving the Field class is because I > can directly manipulate the FieldInfo by setting the frequency. But > the class is final... > > Any other suggestions? > > Thanks > > -John > > On Wed, 07 Jul 2004 14:20:24 -0700, Doug Cutting <[EMAIL PROTECTED]> wrote: > > John Wang wrote: > > > While lucene tokenizes the words in the document, it counts the > > > frequency and figures out the position, we are trying to bypass this > > > stage: For each document, I have a set of words with a know frequency, > > > e.g. java (5), lucene (6) etc. (I don't care about the position, so it > > > can always be 0.) > > > > > > What I can do now is to create a dummy document, e.g. "java java > > > java java java lucene lucene lucene lucene lucene" and pass it to > > > lucene. > > > > > > This seems hacky and cumbersome. Is there a better alternative? I > > > browsed around in the source code, but couldn't find anything. > > > > Write an analyzer that returns terms with the appropriate distribution. > > > > For example: > > > > public class VectorTokenStream extends TokenStream { > > private int term; > > private int freq; > > public VectorTokenStream(String[] terms, int[] freqs) { > > this.terms = terms; > > this.freqs = freqs; > > } > > public Token next() { > > if (freq == 0) { > > term++; > > if (term >= terms.length) > > return null; > > freq = freqs[term]; > > } > > freq--; > > return new Token(terms[term], 0, 0); > > } > > } > > > > Document doc = new Document(); > > doc.add(Field.Text("content", "")); > > indexWriter.addDocument(doc, new Analyzer() { > > public TokenStream tokenStream(String field, Reader reader) { > > return new VectorTokenStream(new String[] {"java","lucene"}, > > new int[] {5,6}); > > } > > }); > > > > > Too bad the Field class is final, otherwise I can derive from it > > > and do something on that line... > > > > Extending Field would not help. That's why it's final. > > > > Doug > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]