Hi all,

We have over 6 million documents in our index, and would like to construct a 
term frequency matrix over all 6 million documents as quickly as possible.  
Each document has a numeric date field, so we would like to build a time series 
which contains values which are the sum of all frequencies for documents on 
that date.  So for example, if the term was "iPhone", we would want a time 
series which contained the sum of all iPhone mentions across all buckets, but 
decomposed into time buckets.

The approach we have tried is to write a custom Collector as below, but this 
seems really, really slow...any way of approaching this differently to make it 
perform much better?

@Override()
public void collect(int docId) throws IOException {
    try {
      ++collectCount;
      if (reader != null) {
          final Terms terms = reader.getTermVector(docId, field);
          termsEnum = terms.iterator(termsEnum);
          final int colIndex = matrix.columns().add(term);
          if (termsEnum.seekExact(termRef)) {
            final DocsAndPositionsEnum docsAndPositionsEnum = 
termsEnum.docsAndPositions(null, null, DocsAndPositionsEnum.FLAG_FREQS);
            while (docsAndPositionsEnum.nextDoc() != 
DocIdSetIterator.NO_MORE_DOCS){
                final int date = dates.get(docId);
                final int freq = docsAndPositionsEnum.freq();
                final int rowIndex = matrix.rows().add(date);
                final double value = matrix.getDouble(rowIndex, colIndex);
                matrix.setDouble(rowIndex, colIndex, Double.isNaN(value) ? freq 
: value + freq);
                if (++docCount % 1000 == 0) {
                  LOG.info("Processed " + docCount + " / " + collectCount + " 
documents in term frequency analysis...");
                }
            }
          }
      }
    } catch (Throwable t) {
      throw new RuntimeException("Failed to collect document " + docId, t);
    }
}

@Override()
public void setNextReader(AtomicReaderContext atomicReaderContext) throws 
IOException {
    this.reader = atomicReaderContext.reader();
    this.dates = FieldCache.DEFAULT.getInts(reader, "date", false);
}

Any help would be much appreciated...

Thanks,
Zav

THIS MESSAGE AND ANY ATTACHMENTS ARE CONFIDENTIAL, PROPRIETARY, AND MAY BE 
PRIVILEGED.  If this message was misdirected, BlackRock, Inc. and its 
subsidiaries, ("BlackRock") does not waive any confidentiality or privilege.  
If you are not the intended recipient, please notify us immediately and destroy 
the message without disclosing its contents to anyone.  Any distribution, use 
or copying of this e-mail or the information it contains by other than an 
intended recipient is unauthorized.  The views and opinions expressed in this 
e-mail message are the author's own and may not reflect the views and opinions 
of BlackRock, unless the author is authorized by BlackRock to express such 
views or opinions on its behalf.  All email sent to or from this address is 
subject to electronic storage and review by BlackRock.  Although BlackRock 
operates anti-virus programs, it does not accept responsibility for any damage 
whatsoever caused by viruses being passed.



--
BlackRock Advisors (UK) Limited and BlackRock Investment Management (UK) 
Limited are authorised and regulated by the Financial Conduct Authority. 
Registered in England No. 796793 and No. 2020394 respectively. BlackRock Life 
Limited is authorised by the Prudential Regulation Authority and regulated by 
the Financial Conduct Authority and Prudential Regulation Authority. Registered 
in England No. 2223202. Registered Offices: Drapers Gardens, 12 Throgmorton 
Avenue, London EC2N 2DL. BlackRock International Limited is authorised and 
regulated by the Financial Conduct Authority and is a registered investment 
adviser with the Securities and Exchange Commission (SEC).  Registered in 
Scotland No. SC160821. Registered Office: 40 Torphichen Street, Edinburgh, EH3 
8JB.

© 2013 BlackRock, Inc. All Rights reserved.

Reply via email to