Hi all, We have over 6 million documents in our index, and would like to construct a term frequency matrix over all 6 million documents as quickly as possible. Each document has a numeric date field, so we would like to build a time series which contains values which are the sum of all frequencies for documents on that date. So for example, if the term was "iPhone", we would want a time series which contained the sum of all iPhone mentions across all buckets, but decomposed into time buckets.
The approach we have tried is to write a custom Collector as below, but this seems really, really slow...any way of approaching this differently to make it perform much better? @Override() public void collect(int docId) throws IOException { try { ++collectCount; if (reader != null) { final Terms terms = reader.getTermVector(docId, field); termsEnum = terms.iterator(termsEnum); final int colIndex = matrix.columns().add(term); if (termsEnum.seekExact(termRef)) { final DocsAndPositionsEnum docsAndPositionsEnum = termsEnum.docsAndPositions(null, null, DocsAndPositionsEnum.FLAG_FREQS); while (docsAndPositionsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS){ final int date = dates.get(docId); final int freq = docsAndPositionsEnum.freq(); final int rowIndex = matrix.rows().add(date); final double value = matrix.getDouble(rowIndex, colIndex); matrix.setDouble(rowIndex, colIndex, Double.isNaN(value) ? freq : value + freq); if (++docCount % 1000 == 0) { LOG.info("Processed " + docCount + " / " + collectCount + " documents in term frequency analysis..."); } } } } } catch (Throwable t) { throw new RuntimeException("Failed to collect document " + docId, t); } } @Override() public void setNextReader(AtomicReaderContext atomicReaderContext) throws IOException { this.reader = atomicReaderContext.reader(); this.dates = FieldCache.DEFAULT.getInts(reader, "date", false); } Any help would be much appreciated... Thanks, Zav THIS MESSAGE AND ANY ATTACHMENTS ARE CONFIDENTIAL, PROPRIETARY, AND MAY BE PRIVILEGED. If this message was misdirected, BlackRock, Inc. and its subsidiaries, ("BlackRock") does not waive any confidentiality or privilege. If you are not the intended recipient, please notify us immediately and destroy the message without disclosing its contents to anyone. Any distribution, use or copying of this e-mail or the information it contains by other than an intended recipient is unauthorized. The views and opinions expressed in this e-mail message are the author's own and may not reflect the views and opinions of BlackRock, unless the author is authorized by BlackRock to express such views or opinions on its behalf. All email sent to or from this address is subject to electronic storage and review by BlackRock. Although BlackRock operates anti-virus programs, it does not accept responsibility for any damage whatsoever caused by viruses being passed. -- BlackRock Advisors (UK) Limited and BlackRock Investment Management (UK) Limited are authorised and regulated by the Financial Conduct Authority. Registered in England No. 796793 and No. 2020394 respectively. BlackRock Life Limited is authorised by the Prudential Regulation Authority and regulated by the Financial Conduct Authority and Prudential Regulation Authority. Registered in England No. 2223202. Registered Offices: Drapers Gardens, 12 Throgmorton Avenue, London EC2N 2DL. BlackRock International Limited is authorised and regulated by the Financial Conduct Authority and is a registered investment adviser with the Securities and Exchange Commission (SEC). Registered in Scotland No. SC160821. Registered Office: 40 Torphichen Street, Edinburgh, EH3 8JB. © 2013 BlackRock, Inc. All Rights reserved.