Re: Extracting article keywords using tf-idf algorithm

Diego Ceccarelli Sat, 18 Jul 2015 08:18:00 -0700

Dear Ali,

I'm not sure I understand what you are trying to do, please correct me if I
misunderstood:
given a document indexed into lucene you want to retrieve the top-k terms
with highest tf-idf right?


Could you please post your code somewhere? I don't understand what is
"mlt"  :)

Cheers,
Diego


On Fri, Jul 17, 2015 at 8:28 AM, Ali Nazemian <alinazem...@gmail.com> wrote:

> Dear Lucene/Solr developers,
> Hi,
> I decided to develop a plugin for Solr in order to extract main keywords
> from article. Since Solr already did the hard-working for calculating
> tf-idf scores I decided to use that for the sake of better performance. I
> know that UpdateRequestProcessor is the best suited extension point for
> adding keyword value to documents. I also find out that I have not any
> access to tf-idf scores inside the UpdateRequestProcessor, because of the
> fact that UpdateRequestProcessor chain will be applied before the process
> of calculating tf-idf scores. Hence, with consulting with Solr/Lucene
> developers I decided to go for searchComponent in order to calculate
> keywords based on tf-idf (Lucene Interesting Terms) on commit/optimize.
> Unfortunately toward this approach, strange core behavior was observed. For
> example sometimes facet wont work on this keyword field or the index
> becomes unstable in search results.
> I really appreciate if someone help me to make it stable.
>
>
> NamedList response = new SimpleOrderedMap();
>     keyword.init(searcher, params);
>     BooleanQuery query = new BooleanQuery();
>     for (String fieldName : keywordSourceFields) {
>       TermQuery termQuery = new TermQuery(new Term(fieldName, "noval"));
>       query.add(termQuery, Occur.MUST_NOT);
>     }
>     TermQuery termQuery = new TermQuery(new Term(keywordField, "noval"));
>     query.add(termQuery, Occur.MUST);
>     RefCounted<IndexWriter> iw = null;
>     IndexWriter writer = null;
>     try {
>       TopDocs results = searcher.search(query, maxNumDocs);
>       ScoreDoc[] hits = results.scoreDocs;
>       iw = solrCoreState.getIndexWriter(core);
>       writer = iw.get();
>       FieldType type = new FieldType(StringField.TYPE_STORED);
>       for (int i = 0; i < hits.length; i++) {
>         Document document = searcher.doc(hits[i].doc);
>         List<String> keywords = keyword.getKeywords(hits[i].doc);
>         if (keywords.size() > 0) document.removeFields(keywordField);
>         for (String word : keywords) {
>           document.add(new Field(keywordField, word, type));
>         }
>         String uniqueKey =
> searcher.getSchema().getUniqueKeyField().getName();
>         writer.updateDocument(new Term(uniqueKey, document.get(uniqueKey)),
>             document);
>       }
>       response.add("Number of Selected Docs", results.totalHits);
>       writer.commit();
>     } catch (IOException | SyntaxError e) {
>       throw new RuntimeException();
>     } finally {
>       if (iw != null) {
>         iw.decref();
>       }
>     }
>
>
> public List<String> getKeywords(int docId) throws SyntaxError {
>     String[] fields = new String[keywordSourceFields.size()];
>     List<String> terms = new ArrayList<String>();
>     fields = keywordSourceFields.toArray(fields);
>     mlt.setFieldNames(fields);
>     mlt.setAnalyzer(indexSearcher.getSchema().getIndexAnalyzer());
>     mlt.setMinTermFreq(minTermFreq);
>     mlt.setMinDocFreq(minDocFreq);
>     mlt.setMinWordLen(minWordLen);
>     mlt.setMaxQueryTerms(maxNumKeywords);
>     mlt.setMaxNumTokensParsed(maxTokensParsed);
>     try {
>
>       terms = Arrays.asList(mlt.retrieveInterestingTerms(docId));
>     } catch (IOException e) {
>       LOGGER.error(e.getMessage());
>       throw new RuntimeException();
>     }
>
>     return terms;
>   }
>
> Best regards.
> --
> A.Nazemian
>

Re: Extracting article keywords using tf-idf algorithm

Reply via email to