Frank, I tried to use BooleanQuery, comprising of several TermQueries
(these represent key:value constraints, where key is the field name, for
example "lang:en"),
but the Scorer, created by Weight in your code, is null. Do you know,
what could be wrong here?
Sorry to bother you on dev list with such questions, but I am trying to
make a CLI util for this code, so I think it would be helpful for
everybody.
On 01/20/2012 02:15 AM, Frank Scholten wrote:
LuceneIndexToSequenceFiles lucene2Seq = new LuceneIndexToSequenceFiles();
Configuration configuration = ... ;
IndexDirectory indexDirectory = ... ;
Path seqPath = ... ;
String idField = ... ;
String field = ... ;
List<String> extraFields = asList( ... );
Query query = ... ;
LuceneIndexToSequenceFilesConfiguration lucene2SeqConf = new
LuceneIndexToSequenceFilesConfiguration(configuration,
indexDirectory.getFile(), seqPath, idField, field);
lucene2SeqConf.setExtraFields(extraFields);
lucene2SeqConf.setQuery(query);
lucene2Seq.run(lucene2SeqConf);
The seqPath variable can be passed into seq2sparse.
Cheers,
Frank
On Thu, Jan 19, 2012 at 2:03 PM, Michael Kazekin
<[email protected]> wrote:
Frank, could you please tell me how to use your lucene2seq tool?
On 01/18/2012 04:57 PM, Frank Scholten wrote:
You can use a MatchAllDocsQuery if you want to fetch all documents.
On Wed, Jan 18, 2012 at 10:36 AM, Michael Kazekin
<[email protected]> wrote:
Thank you, Frank! I'll definitely have a look on it.
As far as I can see, the problem with using Lucene in clusterisation
tasks
is that even with queries you get access to the "tip-of-the-iceberg"
results only, while clusterization tasks need to deal with the results as
a
whole.
On 01/17/2012 09:56 PM, Frank Scholten wrote:
Hi Michael,
Checkouthttps://issues.apache.org/jira/browse/MAHOUT-944
This is a lucene2seq tool. You can pass in fields and a lucene query and
it generates text sequence files.
From there you can use seq2sparse.
Cheers,
Frank
Sorry for brevity, sent from phone
On Jan 17, 2012, at 17:37, Michael
Kazekin<[email protected]> wrote:
Hi!
I am trying to extend "mahout lucene.vector" driver, so that it can be
feeded with arbitrary
key-value constraints on solr schema fields (and generate only a subset
for
mahout vectors,
which seems to be a regular use case).
So the best (easiest) way I see, is to create an IndexReader
implementation
that would allow
to read the subset.
The problem is that I don't know the correct way to do this.
Maybe, subclassing the FilterIndexReader would solve the problem, but I
don't know which
methods to override to get a consistent object representation.
The driver code includes the following:
IndexReader reader = IndexReader.open(dir, true);
Weight weight;
if ("tf".equalsIgnoreCase(weightType)) {
weight = new TF();
} else if ("tfidf".equalsIgnoreCase(weightType)) {
weight = new TFIDF();
} else {
throw new IllegalArgumentException("Weight type " + weightType + "
is
not supported");
}
TermInfo termInfo = new CachedTermInfo(reader, field, minDf,
maxDFPercent);
VectorMapper mapper = new TFDFMapper(reader, weight, termInfo);
LuceneIterable iterable;
if (norm == LuceneIterable.NO_NORMALIZING) {
iterable = new LuceneIterable(reader, idField, field, mapper,
LuceneIterable.NO_NORMALIZING, maxPercentErrorDocs);
} else {
iterable = new LuceneIterable(reader, idField, field, mapper,
norm,
maxPercentErrorDocs);
}
It creates a SequenceFile.Writer class then and writes the "iterable"
variable.
Do you have any thoughts on how to inject the code in a most simple
way?