Re: Resolving term vector even when not stored?

2007-03-17 Thread karl wettin


17 mar 2007 kl. 08.15 skrev Doron Cohen:


"Mike Klaas" <[EMAIL PROTECTED]> wrote on 16/03/2007 14:26:46:


On 3/15/07, karl wettin <[EMAIL PROTECTED]> wrote:

I propose a change of the current IndexReader.getTermFreqVector/s-
code so that it /always/ return the vector space model of a  
document,

even when set fields are set as Field.TermVector.NO.

Is that crazy? Could be really slow, but except for that.. And if it
is cached then that information is known by inspecting the fields.
People don't go fetching term vectors without knowing what thay are
doing, are they?


The highlighting contrib code does this: attempt to retrieve the
termvector, catch InvalidArgumentException, fall back to re-analysis
of the data.


This way makes more sense to me.  IndexReader.getTermFreqVector()  
means its

there, just bring it,


They way I look at it the vector space model is there all the time and
Field.TermVector.YES really means Field.TermVector.Level1Cached.

Also, I would not mind a soft referenced map in IndexReader that keeps
track of all resoved term vectors. Perhaps that should be a decoration.


--
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Resolving term vector even when not stored?

2007-03-16 Thread Doron Cohen
"Mike Klaas" <[EMAIL PROTECTED]> wrote on 16/03/2007 14:26:46:

> On 3/15/07, karl wettin <[EMAIL PROTECTED]> wrote:
> > I propose a change of the current IndexReader.getTermFreqVector/s-
> > code so that it /always/ return the vector space model of a document,
> > even when set fields are set as Field.TermVector.NO.
> >
> > Is that crazy? Could be really slow, but except for that.. And if it
> > is cached then that information is known by inspecting the fields.
> > People don't go fetching term vectors without knowing what thay are
> > doing, are they?
>
> The highlighting contrib code does this: attempt to retrieve the
> termvector, catch InvalidArgumentException, fall back to re-analysis
> of the data.

This way makes more sense to me.  IndexReader.getTermFreqVector() means its
there, just bring it, while the fall-back is more a
computeTermFreqVector(), which takes much more time.  Users would likely
prefer getting an exception for the get() (oops, term vectors were not
saved..) rather then auto falling back to an expensive computation.

This functionality seems proper as a utility, so it can be reused, I think
perhaps in contrib?

>
> I'm not sure if that is crazy, but that is what is currently implemented.
>
> -Mike


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Resolving term vector even when not stored?

2007-03-16 Thread Mike Klaas

On 3/15/07, karl wettin <[EMAIL PROTECTED]> wrote:

I propose a change of the current IndexReader.getTermFreqVector/s-
code so that it /always/ return the vector space model of a document,
even when set fields are set as Field.TermVector.NO.

Is that crazy? Could be really slow, but except for that.. And if it
is cached then that information is known by inspecting the fields.
People don't go fetching term vectors without knowing what thay are
doing, are they?


The highlighting contrib code does this: attempt to retrieve the
termvector, catch InvalidArgumentException, fall back to re-analysis
of the data.

I'm not sure if that is crazy, but that is what is currently implemented.

-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Resolving term vector even when not stored?

2007-03-15 Thread karl wettin
I propose a change of the current IndexReader.getTermFreqVector/s- 
code so that it /always/ return the vector space model of a document,  
even when set fields are set as Field.TermVector.NO.


Is that crazy? Could be really slow, but except for that.. And if it  
is cached then that information is known by inspecting the fields.  
People don't go fetching term vectors without knowing what thay are  
doing, are they?


Whipped something up that builds the data using TermEnum and  
TermPositions. Very simple. Does the job. Some IndexReader  
implementations can do the job consideraly faster than navigating  
that way.


Used this in order to build a tool that can copy one index to another  
using an IndexReader source and an IndexWriter target (and thus  
transparently allows for converting one index to antoher, e.g. load  
FSDirectory to InstantiatedIndex and vice verse). Sort of like  
Directory.addIndex, but for any implementation.




Supersimple code: (Don't pay too much attention at it beeing a Map,  
it should of course be a TermFreqVector.)


package org.apache.lucene.index;

import java.io.IOException;
import java.util.*;

/**
* Resolves a term frequency vector from the inverted term index.
*
* @author: Karl Wettin 
*/
public class DocumentVectorSpaceModel extends HashMapString, /**term meta ordered by term*/ 
List>{


  public DocumentVectorSpaceModel(int doc, IndexReader ir) throws  
IOException {


TermEnum termEnum = ir.terms();
while (termEnum.next()) {
  TermPositions termPositions = ir.termPositions(termEnum.term());
  if (termPositions.skipTo(doc)) {
int[] positions = new int[termPositions.freq()];
for (int i = 0; i < positions.length; i++) {
  positions[i] = termPositions.nextPosition();
}
DocumentTermMeta meta = new DocumentTermMeta(termEnum.term 
(), positions);

List termMetas;
if (!this.containsKey(termEnum.term().field())) {
  termMetas = new LinkedList();
  this.put(termEnum.term().field(), termMetas);
} else {
  termMetas = this.get(termEnum.term().field());
}
int pos = (Collections.binarySearch(termMetas, meta) * -1) - 1;
termMetas.add(pos, meta);
  }
}
  }

  public static class DocumentTermMeta implements  
Comparable {

private Term term;
private int[] termPositions;

public DocumentTermMeta(Term term, int[] termPositions) {
  this.term = term;
  this.termPositions = termPositions;
}

public int compareTo(DocumentTermMeta documentTermMeta) {
  return getTerm().compareTo(documentTermMeta.getTerm());
}

public Term getTerm() {
  return term;
}

public int[] getTermPositions() {
  return termPositions;
}
  }
}













Bonus, my import/export code that use the code above to tokenize a  
whole index and send it to a writer:



package org.apache.lucene.index;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;

import java.io.IOException;
import java.io.Reader;
import java.util.*;

/**
* @author: Karl Wettin 
*/
public class IndexAppender {

  // private static Log log = LogFactory.getLog(IndexReplicator.class);
  // private static long serialVersionUID = 1l;


  /**
   * Adds the complete content of any one index (using an index  
reader) to any other index (using an index writer).

   * 
   * The analyzer creates one complete token stream of all fields  
with the same name the first time it is requested,
   * and after that an empty for each remaining. todo: this is a  
problem?

   * 
   * It can be buggy if the same token appears as synonym to it self  
(position increment 0). not really something to worry about.. or?

   *
   * @param sourceReader the index from wich all content will be  
copied from.

   * @param targetWriter the index in wich content will be copied to.
   * @throws java.io.IOException when accessing source or target.
   */
  public static void append(final IndexReader sourceReader,  
IndexWriterInterface targetWriter) throws IOException {



for (int documentNumber = 0; documentNumber < sourceReader.maxDoc 
(); documentNumber++) {


  final int documentNumberInnerAccessHack = documentNumber;
  final Document document = sourceReader.document(documentNumber);
  final DocumentVectorSpaceModel documentVectorSpaceModel = new  
DocumentVectorSpaceModel(documentNumber, sourceReader);


  targetWriter.addDocument(document, new Analyzer() {

private Set processedFields = new HashSet();

public TokenStream tokenStream(final String fieldName,  
Reader reader) {


  if (!processedFields.add(fieldName)) {
return new TokenStream() {

  public Token next() throws IOException {
ret