Hey John,

Those are just options, didn't say they were good ones!  :-)

I guess the real question is, what is the background of what you are trying to do?  
Presumably you have some other program that is generating frequencies for you, do you 
really need that in the current form?  Can't the Lucene indexing engine act as a 
stand-in for this process since your end result _should_ be the same?  The Lucene 
Analyzer process is quite flexible, I bet you could even find a way to hook in your 
existing tools into the Analyzer process.

-Grant

>>> [EMAIL PROTECTED] 07/08/04 10:42AM >>>
Hi Grant:
     Thanks for the options. How likely will the lucene file formats change?

     Are there really no more optiosn? :(...

Thanks

-John

On Thu, 08 Jul 2004 08:50:44 -0400, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> Hi John,
> 
> The source code is available from CVS, make it non-final and do what you need to do. 
>  Of course, you may have a hard time finding help later if you aren't using 
> something everyone else is and your solution doesn't work...  :-)
> 
> If I understand correctly what you are trying to do, you already know all of the 
> answers for indexing, you just want Lucene to do the retrieval side of the coin, 
> correct?  I suppose a crazy idea might be to write a program that took your info and 
> output it in the Lucene file format, but that seems a bit like overkill.
> 
> -Grant
> 
> >>> [EMAIL PROTECTED] 07/07/04 07:37PM >>>
> 
> 
> Hi Doug:
>     Thanks for the response!
> 
>     The solution you proposed is still a derivative of creating a
> dummy document stream. Taking the same example, java (5), lucene (6),
> VectorTokenStream would create a total of 11 Tokens whereas only 2 is
> neccessary.
> 
>    Given many documents with many terms and frequencies, it would
> create many extra Token instances.
> 
>   The reason I was looking to derving the Field class is because I
> can directly manipulate the FieldInfo by setting the frequency. But
> the class is final...
> 
>   Any other suggestions?
> 
> Thanks
> 
> -John
> 
> On Wed, 07 Jul 2004 14:20:24 -0700, Doug Cutting <[EMAIL PROTECTED]> wrote:
> > John Wang wrote:
> > >      While lucene tokenizes the words in the document, it counts the
> > > frequency and figures out the position, we are trying to bypass this
> > > stage: For each document, I have a set of words with a know frequency,
> > > e.g. java (5), lucene (6) etc. (I don't care about the position, so it
> > > can always be 0.)
> > >
> > >      What I can do now is to create a dummy document, e.g. "java java
> > > java java java lucene lucene lucene lucene lucene" and pass it to
> > > lucene.
> > >
> > >      This seems hacky and cumbersome. Is there a better alternative? I
> > > browsed around in the source code, but couldn't find anything.
> >
> > Write an analyzer that returns terms with the appropriate distribution.
> >
> > For example:
> >
> > public class VectorTokenStream extends TokenStream {
> >   private int term;
> >   private int freq;
> >   public VectorTokenStream(String[] terms, int[] freqs) {
> >     this.terms = terms;
> >     this.freqs = freqs;
> >   }
> >   public Token next() {
> >     if (freq == 0) {
> >       term++;
> >       if (term >= terms.length)
> >         return null;
> >       freq = freqs[term];
> >     }
> >     freq--;
> >     return new Token(terms[term], 0, 0);
> >   }
> > }
> >
> > Document doc = new Document();
> > doc.add(Field.Text("content", ""));
> > indexWriter.addDocument(doc, new Analyzer() {
> >   public TokenStream tokenStream(String field, Reader reader) {
> >     return new VectorTokenStream(new String[] {"java","lucene"},
> >                                  new int[] {5,6});
> >   }
> > });
> >
> > >       Too bad the Field class is final, otherwise I can derive from it
> > > and do something on that line...
> >
> > Extending Field would not help.  That's why it's final.
> >
> > Doug
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED] 
> > For additional commands, e-mail: [EMAIL PROTECTED] 
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED] 
> For additional commands, e-mail: [EMAIL PROTECTED] 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED] 
> For additional commands, e-mail: [EMAIL PROTECTED] 
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to