Hi all,
The following applies to Nutch 1.8 (and at least 1.7 as well, it seems).
I’ve noticed that Nutch throws an exception when the elastic.cluster property
is not set—even when elastic.host and elastic.port are properly configured. In
the documentation for the elastic properties, it says that you can either
specify elastic.cluster, or specify elastic.port together with elastic.host.
However, it seems that org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
throws an exception if elastic.cluster is missing, regardless of whether
elastic.port and elastic.host have been properly set. The exception is thrown
in the ElasticIndexWriter.setConf() method.
Is this a known bug, and has it been fixed in the trunk? I was able to get the
Elasticsearch indexer working properly by setting elastic.host and
elastic.port, and commenting out the if-statement beginning on line 254 in
ElasticIndexWriter.java.
For reference, here are the exception, and the relevant properties in my
nutch-site.xml.
***Exception***
Indexer: java.lang.RuntimeException: Missing elastic.cluster. Should be set in
nutch-site.xml
ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.host : hostname
elastic.port : port
elastic.index : elastic index command
elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
elastic.max.bulk.size : elastic bulk index length. (default 2500500
~2.5MB)
at
org.apache.nutch.indexwriter.elastic.ElasticIndexWriter.setConf(ElasticIndexWriter.java:258)
at
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:159)
at org.apache.nutch.indexer.IndexWriters.<init>(IndexWriters.java:57)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:91)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
***nutch-site.xml***
<property>
<name>elastic.host</name>
<value>localhost</value>
<description>The hostname to send documents to using TransportClient. Either
host
and port must be defined or cluster.</description>
</property>
<property>
<name>elastic.port</name>
<value>9300</value>The port to connect to using TransportClient.<description>
</description>
</property>
Cheers
Jake