Re: Extending mahout lucene.vector driver

Michael Kazekin Tue, 07 Feb 2012 07:59:40 -0800

Hi, Frank!

Sorry for being silent. I'll try to launch unit tests as soon as I havefree time for this (now we use a workaround for this problem in oursolution).

I saw that you started the CLI program, are you going to let the user ofCLI constrain just "columns" (fields), or also the "rows" (values) inthe index?


On 02/01/2012 11:51 AM, Frank Scholten wrote:

Hi Michael,

Can you compare what you are doing with the code from the
testRun_query() unit test in LuceneIndexToSequenceFilesTest? The unit
test works, so I am curious where there is a difference.

Cheers,

Frank

On Fri, Jan 27, 2012 at 4:41 PM, Michael Kazekin
<[email protected]>  wrote:

Frank, I tried this code with Solr 3.5 index (and changed all dependencies
in pom file), but this still doesn't work:

Directory directory = FSDirectory.open(file);
IndexReader reader = IndexReader.open(directory, true);
IndexSearcher searcher = new IndexSearcher(reader);

I try to get a Scorer with this TermQuery ("lang" field is indexed
and stored and all data is available)

TermQuery atomQuery = new TermQuery(new Term("lang", "ru"));

Weight weight = atomQuery.createWeight(searcher);
Scorer scorer = weight.scorer(reader, true, false);

// scorer == null here




On 01/25/2012 07:04 PM, Frank Scholten wrote:

Are you using Lucene 3.4? I had this problem as well and I believe
this was because of https://issues.apache.org/jira/browse/LUCENE-3442
which is fixed in Lucene 3.5.

On Wed, Jan 25, 2012 at 1:42 PM, Michael Kazekin
<[email protected]>    wrote:

Frank, I tried to use BooleanQuery, comprising of several TermQueries
(these
represent key:value constraints, where key is the field name, for example
"lang:en"),
but the Scorer, created by Weight in your code, is null. Do you know,
what
could be wrong here?

Sorry to bother you on dev list with such questions, but I am trying to
make
a CLI util for this code, so I think it would be helpful for everybody.

Great! Let me know if you need more help.

Cheers,

Frank

On 01/20/2012 02:15 AM, Frank Scholten wrote:

LuceneIndexToSequenceFiles lucene2Seq = new
LuceneIndexToSequenceFiles();

Configuration configuration = ... ;
IndexDirectory indexDirectory = ... ;
Path seqPath = ... ;
String idField = ... ;
String field = ... ;
List<String>      extraFields = asList( ... );
Query query = ... ;

LuceneIndexToSequenceFilesConfiguration lucene2SeqConf = new
LuceneIndexToSequenceFilesConfiguration(configuration,
indexDirectory.getFile(), seqPath, idField, field);
lucene2SeqConf.setExtraFields(extraFields);
lucene2SeqConf.setQuery(query);

lucene2Seq.run(lucene2SeqConf);

The seqPath variable can be passed into seq2sparse.

Cheers,

Frank

On Thu, Jan 19, 2012 at 2:03 PM, Michael Kazekin
<[email protected]>      wrote:

Frank, could you please tell me how to use your lucene2seq tool?




On 01/18/2012 04:57 PM, Frank Scholten wrote:

You can use a MatchAllDocsQuery if you want to fetch all documents.

On Wed, Jan 18, 2012 at 10:36 AM, Michael Kazekin
<[email protected]>        wrote:

Thank you, Frank! I'll definitely have a look on it.

As far as I can see, the problem with using Lucene in clusterisation
tasks
is that even with queries you get access to the "tip-of-the-iceberg"
results only, while clusterization tasks need to deal with the
results
as
a
whole.


On 01/17/2012 09:56 PM, Frank Scholten wrote:

Hi Michael,

Checkouthttps://issues.apache.org/jira/browse/MAHOUT-944


This is a lucene2seq tool. You can pass in fields and a lucene query
and
it generates text sequence files.

  From there you can use seq2sparse.

Cheers,

Frank

Sorry for brevity, sent from phone

On Jan 17, 2012, at 17:37, Michael
Kazekin<[email protected]>          wrote:

Hi!

I am trying to extend "mahout lucene.vector" driver, so that it can
be
feeded with arbitrary
key-value constraints on solr schema fields (and generate only a
subset
for
mahout vectors,
which seems to be a regular use case).

So the best (easiest) way I see, is to create an IndexReader
implementation
that would allow
to read the subset.

The problem is that I don't know the correct way to do this.

Maybe, subclassing the FilterIndexReader would solve the problem,
but
I
don't know which
methods to override to get a consistent object representation.



The driver code includes the following:



IndexReader reader = IndexReader.open(dir, true);

    Weight weight;
    if ("tf".equalsIgnoreCase(weightType)) {
      weight = new TF();
    } else if ("tfidf".equalsIgnoreCase(weightType)) {
      weight = new TFIDF();
    } else {
      throw new IllegalArgumentException("Weight type " + weightType
+
"
is
not supported");
    }

    TermInfo termInfo = new CachedTermInfo(reader, field, minDf,
maxDFPercent);
    VectorMapper mapper = new TFDFMapper(reader, weight, termInfo);

    LuceneIterable iterable;

    if (norm == LuceneIterable.NO_NORMALIZING) {
      iterable = new LuceneIterable(reader, idField, field, mapper,
LuceneIterable.NO_NORMALIZING, maxPercentErrorDocs);
    } else {
      iterable = new LuceneIterable(reader, idField, field, mapper,
norm,
maxPercentErrorDocs);
    }




It creates a SequenceFile.Writer class then and writes the
"iterable"
variable.


Do you have any thoughts on how to inject the code in a most simple
way?

Re: Extending mahout lucene.vector driver

Reply via email to