DIH should be able read data directly from HDFS for indexing
------------------------------------------------------------

                 Key: SOLR-2096
                 URL: https://issues.apache.org/jira/browse/SOLR-2096
             Project: Solr
          Issue Type: New Feature
          Components: contrib - DataImportHandler
    Affects Versions: 1.4.1
            Reporter: Amit Nithian
             Fix For: 1.4.2
         Attachments: hdfs_reader.tar

DIH doesn't support reading from the hdfs:// protocol which makes it hard to 
index data generated by a M/R job. This tarball contains a subclass of the 
URLDataSource along with an HDFSReader that allows for this. The data is 
assumed to be in text format and able to be processed by the 
LineEntityProcessor.

Here is an example DIH-Config snippet:
  <dataSource name="queryData" 
type="org.apache.solr.handler.dataimport.hdfs.HDFSDataSource" 
  baseUrl="hdfs://<YOURSERVER>:9000/" encoding="UTF-8" 
  connectionTimeout="5000" readTimeout="10000"/>
        <document name="autoSuggester">
                <entity name="jc" processor="LineEntityProcessor"
                        url="<YOUR FOLDER>/part*" dataSource="queryData">
<!-- Field mappings here if necessary -->
                </entity>
        </document>
</dataConfig>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to