UNSUBSCRIBE

Jonathan Rivers Wed, 04 Dec 2013 12:12:18 -0800

UNSUBSCRIBE

On Sat, Nov 16, 2013 at 4:23 AM, Rafał Kuć (JIRA) <[email protected]> wrote:


>
>     [
> https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13824426#comment-13824426]
>
> Rafał Kuć commented on SOLR-1301:
> ---------------------------------
>
> Mark, is the version attached to this issue a newest patch or maybe you
> have something newer?
>
> > Add a Solr contrib that allows for building Solr indexes via Hadoop's
> Map-Reduce.
> >
> ---------------------------------------------------------------------------------
> >
> >                 Key: SOLR-1301
> >                 URL: https://issues.apache.org/jira/browse/SOLR-1301
> >             Project: Solr
> >          Issue Type: New Feature
> >            Reporter: Andrzej Bialecki
> >            Assignee: Mark Miller
> >             Fix For: 4.6
> >
> >         Attachments: README.txt, SOLR-1301-hadoop-0-20.patch,
> SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch,
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java,
> commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar,
> hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar,
> hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar
> >
> >
> > This patch contains  a contrib module that provides distributed indexing
> (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is
> twofold:
> > * provide an API that is familiar to Hadoop developers, i.e. that of
> OutputFormat
> > * avoid unnecessary export and (de)serialization of data maintained on
> HDFS. SolrOutputFormat consumes data produced by reduce tasks directly,
> without storing it in intermediate files. Furthermore, by using an
> EmbeddedSolrServer, the indexing task is split into as many parts as there
> are reducers, and the data to be indexed is not sent over the network.
> > Design
> > ----------
> > Key/value pairs produced by reduce tasks are passed to SolrOutputFormat,
> which in turn uses SolrRecordWriter to write this data. SolrRecordWriter
> instantiates an EmbeddedSolrServer, and it also instantiates an
> implementation of SolrDocumentConverter, which is responsible for turning
> Hadoop (key, value) into a SolrInputDocument. This data is then added to a
> batch, which is periodically submitted to EmbeddedSolrServer. When reduce
> task completes, and the OutputFormat is closed, SolrRecordWriter calls
> commit() and optimize() on the EmbeddedSolrServer.
> > The API provides facilities to specify an arbitrary existing solr.home
> directory, from which the conf/ and lib/ files will be taken.
> > This process results in the creation of as many partial Solr home
> directories as there were reduce tasks. The output shards are placed in the
> output directory on the default filesystem (e.g. HDFS). Such part-NNNNN
> directories can be used to run N shard servers. Additionally, users can
> specify the number of reduce tasks, in particular 1 reduce task, in which
> case the output will consist of a single shard.
> > An example application is provided that processes large CSV files and
> uses this API. It uses a custom CSV processing to avoid (de)serialization
> overhead.
> > This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this
> issue, you should put it in contrib/hadoop/lib.
> > Note: the development of this patch was sponsored by an anonymous
> contributor and approved for release under Apache License.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.1#6144)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

UNSUBSCRIBE

Reply via email to