[ https://issues.apache.org/jira/browse/SOLR-5244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13862985#comment-13862985 ]
Joel Bernstein commented on SOLR-5244: -------------------------------------- Mikhail, For my test I was extracting a single string field, using in memory docValues. Because docValues are column oriented, as you mentioned, each column lookup will be an additional hit on performance. I'm seeing two possible approaches to this: 1) Add a special cache that speeds up the docId-> bytesRef lookup. This would be a segment level cache of the top N terms (by frequency) in the index. The cache would be a simple int to BytesRef hashmap, mapping the segment level ord to the bytesRef. This cache would be much faster then the binaryDocValues docId->byteRef lookup, so if there was a decent cache hit rate, performance could be improved dramatically. This approach would improve performance if the fields were kept separate so you could pick and choose what to export. 2) Only export a single field. With this approach you would have one docValues field that would hold the entire extract record. You could use json or a binary format to structure this field anyway you want. With this approach, caches wouldn't help but you'd eliminate the penalty for looking data in multiple columns. I'm leaning towards this approach. With either approach, threading could be used to increase throughput. You could have a thread per segment extracting records and adding to a queue, and a single thread pulling from the queue and streaming the data out. You're right, 5 million is not going to happen with the network limitations. Then the goal could be to export data as fast as the network can send it out. You could throttle this by having fewer threads extracting records from the segments. Joel > Full Search Result Export > ------------------------- > > Key: SOLR-5244 > URL: https://issues.apache.org/jira/browse/SOLR-5244 > Project: Solr > Issue Type: New Feature > Components: search > Affects Versions: 5.0 > Reporter: Joel Bernstein > Priority: Minor > Fix For: 5.0 > > Attachments: SOLR-5244.patch > > > It would be great if Solr could efficiently export entire search result sets > without scoring or ranking documents. This would allow external systems to > perform rapid bulk imports from Solr. It also provides a possible platform > for exporting results to support distributed join scenarios within Solr. > This ticket provides a patch that has two pluggable components: > 1) ExportQParserPlugin: which is a post filter that gathers a BitSet with > document results and does not delegate to ranking collectors. Instead it puts > the BitSet on the request context. > 2) BinaryExportWriter: Is a output writer that iterates the BitSet and prints > the entire result as a binary stream. A header is provided at the beginning > of the stream so external clients can self configure. > Note: > These two components will be sufficient for a non-distributed environment. > For distributed export a new Request handler will need to be developed. > After applying the patch and building the dist or example, you can register > the components through the following changes to solrconfig.xml > Register export contrib libraries: > <lib dir="../../../dist/" regex="solr-export-\d.*\.jar" /> > > Register the "export" queryParser with the following line: > > <queryParser name="export" > class="org.apache.solr.export.ExportQParserPlugin"/> > > Register the "xbin" writer: > > <queryResponseWriter name="xbin" > class="org.apache.solr.export.BinaryExportWriter"/> > > The following query will perform the export: > {code} > http://localhost:8983/solr/collection1/select?q=*:*&fq={!export}&wt=xbin&fl=join_i > {code} > Initial patch supports export of four data-types: > 1) Single value trie int, long and float > 2) Binary doc values. > The numerics are currently exported from the FieldCache and the Binary doc > values can be in memory or on disk. > Since this is designed to export very large result sets efficiently, stored > fields are not used for the export. -- This message was sent by Atlassian JIRA (v6.1.5#6160) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org