Re: indexing help

John Wang Thu, 08 Jul 2004 07:42:45 -0700

Hi Grant:
     Thanks for the options. How likely will the lucene file formats change?


     Are there really no more optiosn? :(...

Thanks

-John

On Thu, 08 Jul 2004 08:50:44 -0400, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> Hi John,
> 
> The source code is available from CVS, make it non-final and do what you need to do. 
>  Of course, you may have a hard time finding help later if you aren't using 
> something everyone else is and your solution doesn't work...  :-)
> 
> If I understand correctly what you are trying to do, you already know all of the 
> answers for indexing, you just want Lucene to do the retrieval side of the coin, 
> correct?  I suppose a crazy idea might be to write a program that took your info and 
> output it in the Lucene file format, but that seems a bit like overkill.
> 
> -Grant
> 
> >>> [EMAIL PROTECTED] 07/07/04 07:37PM >>>
> 
> 
> Hi Doug:
>     Thanks for the response!
> 
>     The solution you proposed is still a derivative of creating a
> dummy document stream. Taking the same example, java (5), lucene (6),
> VectorTokenStream would create a total of 11 Tokens whereas only 2 is
> neccessary.
> 
>    Given many documents with many terms and frequencies, it would
> create many extra Token instances.
> 
>   The reason I was looking to derving the Field class is because I
> can directly manipulate the FieldInfo by setting the frequency. But
> the class is final...
> 
>   Any other suggestions?
> 
> Thanks
> 
> -John
> 
> On Wed, 07 Jul 2004 14:20:24 -0700, Doug Cutting <[EMAIL PROTECTED]> wrote:
> > John Wang wrote:
> > >      While lucene tokenizes the words in the document, it counts the
> > > frequency and figures out the position, we are trying to bypass this
> > > stage: For each document, I have a set of words with a know frequency,
> > > e.g. java (5), lucene (6) etc. (I don't care about the position, so it
> > > can always be 0.)
> > >
> > >      What I can do now is to create a dummy document, e.g. "java java
> > > java java java lucene lucene lucene lucene lucene" and pass it to
> > > lucene.
> > >
> > >      This seems hacky and cumbersome. Is there a better alternative? I
> > > browsed around in the source code, but couldn't find anything.
> >
> > Write an analyzer that returns terms with the appropriate distribution.
> >
> > For example:
> >
> > public class VectorTokenStream extends TokenStream {
> >   private int term;
> >   private int freq;
> >   public VectorTokenStream(String[] terms, int[] freqs) {
> >     this.terms = terms;
> >     this.freqs = freqs;
> >   }
> >   public Token next() {
> >     if (freq == 0) {
> >       term++;
> >       if (term >= terms.length)
> >         return null;
> >       freq = freqs[term];
> >     }
> >     freq--;
> >     return new Token(terms[term], 0, 0);
> >   }
> > }
> >
> > Document doc = new Document();
> > doc.add(Field.Text("content", ""));
> > indexWriter.addDocument(doc, new Analyzer() {
> >   public TokenStream tokenStream(String field, Reader reader) {
> >     return new VectorTokenStream(new String[] {"java","lucene"},
> >                                  new int[] {5,6});
> >   }
> > });
> >
> > >       Too bad the Field class is final, otherwise I can derive from it
> > > and do something on that line...
> >
> > Extending Field would not help.  That's why it's final.
> >
> > Doug
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: indexing help

Reply via email to