[ https://issues.apache.org/jira/browse/MAHOUT-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Suneel Marthi updated MAHOUT-1292: ---------------------------------- Summary: lucene2seq should validate the 'id' field (was: lucene2seq creates single document from index) > lucene2seq should validate the 'id' field > ----------------------------------------- > > Key: MAHOUT-1292 > URL: https://issues.apache.org/jira/browse/MAHOUT-1292 > Project: Mahout > Issue Type: Bug > Components: Integration > Affects Versions: 0.8 > Reporter: Liz Merkhofer > Assignee: Suneel Marthi > Labels: cvb, lucene, solr > Fix For: 0.9 > > Attachments: MAHOUT-1292.patch > > > Lucene2seq creates only one sequencefile, rather than a file for each > document in the index. > Running lucene2seq on my Solr (4.3) index produces a file with a header and, > it seems, the field I specified from the index, concatenated for all the > documents. After running this through seq2sparse and rowid (to prepare for > cvb), the resulting matrix has only one row, though it should create one row > per document. > This issue prevents, at least, data from a lucene index from being easily > used as input for cvb. Lucene.vector is also currently inadequate: the keys > to its sequence files are LongWriteable, and rowid will not convert only Text > to IntWriteable, as is necessary for the keys in cvb. -- This message was sent by Atlassian JIRA (v6.1#6144)