[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12912393#comment-12912393
 ] 

Alexander Kanarsky edited comment on SOLR-1301 at 5/8/11 8:42 PM:
------------------------------------------------------------------

The latest 0.20 patch is repackaged to be placed under the contrib, as it was 
initially (build.xml is included). and tested against the current trunk. As 
usual, after applying the patch put the 4 lib jars (hadoop, log4j, and two 
commons-logging) to the contrib/hadoop/lib. No unit tests as for now :) but I 
hope to add some soon. Here is the big question: as Andrzej once mentioned, the 
unit tests require a running Hadoop cluster. One approach is to make the patch 
and unit tests working with the Hadoop mini--cluster 
(ClusterMapReduceTestCase), however this will bring some extra dependencies 
needed to run the cluster (like jetty). Another idea is to use "your own" 
cluster and just configure access to this cluster in untt tests; this approach 
seems to be logical but potentially may give different test results on 
different clusters, and also may not give some low-level access to the 
execution, needed for tests. So what is your opinion on how the tests for 
solr-hadoop should be run? I am not really happy with the idea of starting and 
running the Hadoop cluster while performing the Solr unit tests, but this still 
could be the better option than no unit tests at all.  

      was (Author: kanarsky):
    The latest SOLR-1301-hadoop-0-20 patch is repackaged to be placed under the 
contrib, as it was initially (build.xml is included). and tested against the 
current trunk. As usual, after applying the patch put the 4 lib jars (hadoop, 
log4j, and two commons-logging) to the contrib/hadoop/lib. No unit tests as for 
now :) but I hope to add some soon. Here is the big question: as Andrzej once 
mentioned, the unit tests require a running Hadoop cluster. One approach is to 
make the patch and unit tests working with the Hadoop mini--cluster 
(ClusterMapReduceTestCase), however this will bring some extra dependencies 
needed to run the cluster (like jetty). Another idea is to use "your own" 
cluster and just configure access to this cluster in untt tests; this approach 
seems to be logical but potentially may give different test results on 
different clusters, and also may not give some low-level access to the 
execution, needed for tests. So what is your opinion on how the tests for 
solr-hadoop should be run? I am not really happy with the idea of starting and 
running the Hadoop cluster while performing the Solr unit tests, but this still 
could be the better option than no unit tests at all.  
  
> Solr + Hadoop
> -------------
>
>                 Key: SOLR-1301
>                 URL: https://issues.apache.org/jira/browse/SOLR-1301
>             Project: Solr
>          Issue Type: Improvement
>    Affects Versions: 1.4
>            Reporter: Andrzej Bialecki 
>             Fix For: 3.2
>
>         Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
> SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch, 
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, 
> commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, 
> hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop.patch, log4j-1.2.15.jar
>
>
> This patch contains  a contrib module that provides distributed indexing 
> (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
> twofold:
> * provide an API that is familiar to Hadoop developers, i.e. that of 
> OutputFormat
> * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
> SolrOutputFormat consumes data produced by reduce tasks directly, without 
> storing it in intermediate files. Furthermore, by using an 
> EmbeddedSolrServer, the indexing task is split into as many parts as there 
> are reducers, and the data to be indexed is not sent over the network.
> Design
> ----------
> Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
> which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
> instantiates an EmbeddedSolrServer, and it also instantiates an 
> implementation of SolrDocumentConverter, which is responsible for turning 
> Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
> batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
> task completes, and the OutputFormat is closed, SolrRecordWriter calls 
> commit() and optimize() on the EmbeddedSolrServer.
> The API provides facilities to specify an arbitrary existing solr.home 
> directory, from which the conf/ and lib/ files will be taken.
> This process results in the creation of as many partial Solr home directories 
> as there were reduce tasks. The output shards are placed in the output 
> directory on the default filesystem (e.g. HDFS). Such part-NNNNN directories 
> can be used to run N shard servers. Additionally, users can specify the 
> number of reduce tasks, in particular 1 reduce task, in which case the output 
> will consist of a single shard.
> An example application is provided that processes large CSV files and uses 
> this API. It uses a custom CSV processing to avoid (de)serialization overhead.
> This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
> issue, you should put it in contrib/hadoop/lib.
> Note: the development of this patch was sponsored by an anonymous contributor 
> and approved for release under Apache License.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to