UNSUBSCRIBE On Sat, Nov 16, 2013 at 4:23 AM, Rafał Kuć (JIRA) <j...@apache.org> wrote:
> > [ > https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13824426#comment-13824426] > > Rafał Kuć commented on SOLR-1301: > --------------------------------- > > Mark, is the version attached to this issue a newest patch or maybe you > have something newer? > > > Add a Solr contrib that allows for building Solr indexes via Hadoop's > Map-Reduce. > > > --------------------------------------------------------------------------------- > > > > Key: SOLR-1301 > > URL: https://issues.apache.org/jira/browse/SOLR-1301 > > Project: Solr > > Issue Type: New Feature > > Reporter: Andrzej Bialecki > > Assignee: Mark Miller > > Fix For: 4.6 > > > > Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, > SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch, > SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, > SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, > SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, > SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, > commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, > hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, > hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar > > > > > > This patch contains a contrib module that provides distributed indexing > (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is > twofold: > > * provide an API that is familiar to Hadoop developers, i.e. that of > OutputFormat > > * avoid unnecessary export and (de)serialization of data maintained on > HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, > without storing it in intermediate files. Furthermore, by using an > EmbeddedSolrServer, the indexing task is split into as many parts as there > are reducers, and the data to be indexed is not sent over the network. > > Design > > ---------- > > Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, > which in turn uses SolrRecordWriter to write this data. SolrRecordWriter > instantiates an EmbeddedSolrServer, and it also instantiates an > implementation of SolrDocumentConverter, which is responsible for turning > Hadoop (key, value) into a SolrInputDocument. This data is then added to a > batch, which is periodically submitted to EmbeddedSolrServer. When reduce > task completes, and the OutputFormat is closed, SolrRecordWriter calls > commit() and optimize() on the EmbeddedSolrServer. > > The API provides facilities to specify an arbitrary existing solr.home > directory, from which the conf/ and lib/ files will be taken. > > This process results in the creation of as many partial Solr home > directories as there were reduce tasks. The output shards are placed in the > output directory on the default filesystem (e.g. HDFS). Such part-NNNNN > directories can be used to run N shard servers. Additionally, users can > specify the number of reduce tasks, in particular 1 reduce task, in which > case the output will consist of a single shard. > > An example application is provided that processes large CSV files and > uses this API. It uses a custom CSV processing to avoid (de)serialization > overhead. > > This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this > issue, you should put it in contrib/hadoop/lib. > > Note: the development of this patch was sponsored by an anonymous > contributor and approved for release under Apache License. > > > > -- > This message was sent by Atlassian JIRA > (v6.1#6144) > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >