Dear Ali, I'm not sure I understand what you are trying to do, please correct me if I misunderstood: given a document indexed into lucene you want to retrieve the top-k terms with highest tf-idf right?
Could you please post your code somewhere? I don't understand what is "mlt" :) Cheers, Diego On Fri, Jul 17, 2015 at 8:28 AM, Ali Nazemian <alinazem...@gmail.com> wrote: > Dear Lucene/Solr developers, > Hi, > I decided to develop a plugin for Solr in order to extract main keywords > from article. Since Solr already did the hard-working for calculating > tf-idf scores I decided to use that for the sake of better performance. I > know that UpdateRequestProcessor is the best suited extension point for > adding keyword value to documents. I also find out that I have not any > access to tf-idf scores inside the UpdateRequestProcessor, because of the > fact that UpdateRequestProcessor chain will be applied before the process > of calculating tf-idf scores. Hence, with consulting with Solr/Lucene > developers I decided to go for searchComponent in order to calculate > keywords based on tf-idf (Lucene Interesting Terms) on commit/optimize. > Unfortunately toward this approach, strange core behavior was observed. For > example sometimes facet wont work on this keyword field or the index > becomes unstable in search results. > I really appreciate if someone help me to make it stable. > > > NamedList response = new SimpleOrderedMap(); > keyword.init(searcher, params); > BooleanQuery query = new BooleanQuery(); > for (String fieldName : keywordSourceFields) { > TermQuery termQuery = new TermQuery(new Term(fieldName, "noval")); > query.add(termQuery, Occur.MUST_NOT); > } > TermQuery termQuery = new TermQuery(new Term(keywordField, "noval")); > query.add(termQuery, Occur.MUST); > RefCounted<IndexWriter> iw = null; > IndexWriter writer = null; > try { > TopDocs results = searcher.search(query, maxNumDocs); > ScoreDoc[] hits = results.scoreDocs; > iw = solrCoreState.getIndexWriter(core); > writer = iw.get(); > FieldType type = new FieldType(StringField.TYPE_STORED); > for (int i = 0; i < hits.length; i++) { > Document document = searcher.doc(hits[i].doc); > List<String> keywords = keyword.getKeywords(hits[i].doc); > if (keywords.size() > 0) document.removeFields(keywordField); > for (String word : keywords) { > document.add(new Field(keywordField, word, type)); > } > String uniqueKey = > searcher.getSchema().getUniqueKeyField().getName(); > writer.updateDocument(new Term(uniqueKey, document.get(uniqueKey)), > document); > } > response.add("Number of Selected Docs", results.totalHits); > writer.commit(); > } catch (IOException | SyntaxError e) { > throw new RuntimeException(); > } finally { > if (iw != null) { > iw.decref(); > } > } > > > public List<String> getKeywords(int docId) throws SyntaxError { > String[] fields = new String[keywordSourceFields.size()]; > List<String> terms = new ArrayList<String>(); > fields = keywordSourceFields.toArray(fields); > mlt.setFieldNames(fields); > mlt.setAnalyzer(indexSearcher.getSchema().getIndexAnalyzer()); > mlt.setMinTermFreq(minTermFreq); > mlt.setMinDocFreq(minDocFreq); > mlt.setMinWordLen(minWordLen); > mlt.setMaxQueryTerms(maxNumKeywords); > mlt.setMaxNumTokensParsed(maxTokensParsed); > try { > > terms = Arrays.asList(mlt.retrieveInterestingTerms(docId)); > } catch (IOException e) { > LOGGER.error(e.getMessage()); > throw new RuntimeException(); > } > > return terms; > } > > Best regards. > -- > A.Nazemian >