Re: indexing help

John Wang Thu, 08 Jul 2004 08:19:51 -0700

Hi Grant:

     I have something that would extract only the important words from
a document along with its importance, furthermore, these important
words may not be physically in the document, it could be synonyms to
some of the words in the document. So the output of a process for a
document is a list of word/importance pairs.


    I want to be able to query using only these words on the document. 

   I don't think Lucene has such capability.

   Can you suggest what I can do with the analysers process in doing
this without replicating words/tokens?

Thanks

-John

On Thu, 08 Jul 2004 11:10:07 -0400, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> Hey John,
> 
> Those are just options, didn't say they were good ones!  :-)
> 
> I guess the real question is, what is the background of what you are trying to do?  
> Presumably you have some other program that is generating frequencies for you, do 
> you really need that in the current form?  Can't the Lucene indexing engine act as a 
> stand-in for this process since your end result _should_ be the same?  The Lucene 
> Analyzer process is quite flexible, I bet you could even find a way to hook in your 
> existing tools into the Analyzer process.
> 
> -Grant
> 
> >>> [EMAIL PROTECTED] 07/08/04 10:42AM >>>
> 
> 
> Hi Grant:
>     Thanks for the options. How likely will the lucene file formats change?
> 
>     Are there really no more optiosn? :(...
> 
> Thanks
> 
> -John
> 
> On Thu, 08 Jul 2004 08:50:44 -0400, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> > Hi John,
> >
> > The source code is available from CVS, make it non-final and do what you need to 
> > do.  Of course, you may have a hard time finding help later if you aren't using 
> > something everyone else is and your solution doesn't work...  :-)
> >
> > If I understand correctly what you are trying to do, you already know all of the 
> > answers for indexing, you just want Lucene to do the retrieval side of the coin, 
> > correct?  I suppose a crazy idea might be to write a program that took your info 
> > and output it in the Lucene file format, but that seems a bit like overkill.
> >
> > -Grant
> >
> > >>> [EMAIL PROTECTED] 07/07/04 07:37PM >>>
> >
> >
> > Hi Doug:
> >     Thanks for the response!
> >
> >     The solution you proposed is still a derivative of creating a
> > dummy document stream. Taking the same example, java (5), lucene (6),
> > VectorTokenStream would create a total of 11 Tokens whereas only 2 is
> > neccessary.
> >
> >    Given many documents with many terms and frequencies, it would
> > create many extra Token instances.
> >
> >   The reason I was looking to derving the Field class is because I
> > can directly manipulate the FieldInfo by setting the frequency. But
> > the class is final...
> >
> >   Any other suggestions?
> >
> > Thanks
> >
> > -John
> >
> > On Wed, 07 Jul 2004 14:20:24 -0700, Doug Cutting <[EMAIL PROTECTED]> wrote:
> > > John Wang wrote:
> > > >      While lucene tokenizes the words in the document, it counts the
> > > > frequency and figures out the position, we are trying to bypass this
> > > > stage: For each document, I have a set of words with a know frequency,
> > > > e.g. java (5), lucene (6) etc. (I don't care about the position, so it
> > > > can always be 0.)
> > > >
> > > >      What I can do now is to create a dummy document, e.g. "java java
> > > > java java java lucene lucene lucene lucene lucene" and pass it to
> > > > lucene.
> > > >
> > > >      This seems hacky and cumbersome. Is there a better alternative? I
> > > > browsed around in the source code, but couldn't find anything.
> > >
> > > Write an analyzer that returns terms with the appropriate distribution.
> > >
> > > For example:
> > >
> > > public class VectorTokenStream extends TokenStream {
> > >   private int term;
> > >   private int freq;
> > >   public VectorTokenStream(String[] terms, int[] freqs) {
> > >     this.terms = terms;
> > >     this.freqs = freqs;
> > >   }
> > >   public Token next() {
> > >     if (freq == 0) {
> > >       term++;
> > >       if (term >= terms.length)
> > >         return null;
> > >       freq = freqs[term];
> > >     }
> > >     freq--;
> > >     return new Token(terms[term], 0, 0);
> > >   }
> > > }
> > >
> > > Document doc = new Document();
> > > doc.add(Field.Text("content", ""));
> > > indexWriter.addDocument(doc, new Analyzer() {
> > >   public TokenStream tokenStream(String field, Reader reader) {
> > >     return new VectorTokenStream(new String[] {"java","lucene"},
> > >                                  new int[] {5,6});
> > >   }
> > > });
> > >
> > > >       Too bad the Field class is final, otherwise I can derive from it
> > > > and do something on that line...
> > >
> > > Extending Field would not help.  That's why it's final.
> > >
> > > Doug
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: indexing help

Reply via email to