Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "IndexWriters" page has been changed by RoannelFernandez:
https://wiki.apache.org/nutch/IndexWriters

Comment:
Parameters from index-writers.xml and a few changes

New page:
= Index writers configuration =

<<TableOfContents(4)>>

== Structure of index-writers.xml ==

== Mapping section ==

== Parameters section ==

=== Solr indexer properties ===

||'''Parameter Name''' ||'''Description''' ||'''Default value''' ||
||type ||Specifies the 
[[https://lucene.apache.org/solr/5_5_0/solr-solrj/org/apache/solr/client/solrj/SolrClient.html|SolrClient]]
 implementation to use. This is a string value of one of the following 
'''cloud''' or '''http'''. The values represent 
[[https://lucene.apache.org/solr/5_5_0/solr-solrj/org/apache/solr/client/solrj/impl/CloudSolrServer.html|CloudSolrServer]]
 or 
[[https://lucene.apache.org/solr/5_5_0/solr-solrj/org/apache/solr/client/solrj/impl/HttpSolrServer.html|HttpSolrServer]]
 respectively. ||http ||
||url ||Defines the Solr URL into which data should be indexed (This should be 
a fully qualified URL). Multiple URL can be provided using comma as a 
delimiter. ||http://localhost:8983/solr/nutch ||
||commitSize ||Defines the number of documents to send to Solr in a single 
update batch. Decrease when handling very large documents to prevent Nutch from 
running out of memory.<<BR>> '''Note''': It does not explicitly trigger a 
server side commit. ||250 ||
||auth || Whether to enable HTTP basic authentication for communicating with 
Solr. Use the [[#username|username]] and [[#password|password]] properties to 
configure your credentials. ||false ||
||<<Anchor(username)>> username ||The username of Solr server. ||username ||
||<<Anchor(password)>> password ||The password of Solr server. ||password ||

=== Elasticsearch indexer properties ===

||'''Parameter Name''' ||'''Description''' ||'''Default value''' ||
|| host || Comma-separated list of hostnames to send documents to using 
TransportClient. Either host and port must be defined or cluster. ||  ||
|| port || The port to connect to using TransportClient. || 9300 ||
|| cluster || The cluster name to discover. Either host and port must be 
defined or cluster. ||  ||
|| index || Default index to send documents to. || nutch ||
|| max.bulk.docs || Maximum size of the bulk in number of documents. || 250 ||
|| max.bulk.size || Maximum size of the bulk in bytes. || 2500500 ||
|| exponential.backoff.millis || Initial delay for the BulkProcessor's 
exponential backoff policy. || 100 ||
|| exponential.backoff.retries || Number of times the BulkProcessor's 
exponential backoff policy should retry bulk operations. || 10 ||
|| bulk.close.timeout || Number of seconds allowed for the BulkProcessor to 
complete its last operation. || 600 ||

=== Rabbit indexer properties ===

||'''Parameter Name''' ||'''Description''' ||'''Default value''' ||
|| server.uri || URI with connection parameters in the form 
amqp://username:password@hostname:port/virtualHost <<BR>> Where: 
<<Include(IndexWriters/RabbitURIParts)>> || amqp://guest:guest@localhost:5672/ 
||
|| binding || Whether the relationship between an exchange and a queue is 
created automatically. Default "false". <<BR>> '''NOTE:''' Binding between 
exchanges is not supported. || false ||
|| binding.arguments || Arguments used in binding. It must have the form 
key1=value1,key2=value2. This value is only used when the exchange's type is 
headers and the value of 'rabbitmq.indexer.binding' property is true. In other 
cases is ignored. ||  ||
|| exchange.name || Name for the exchange where the messages will be sent. 
Default "". ||  ||
|| exchange.options || Options used when the exchange is created. Only used 
when the value of 'rabbitmq.indexer.binding' property is true. Default 
"type=direct,durable=true". || type=direct,durable=true ||
|| queue.name || Name of the queue used to create the binding. Default 
"nutch.queue". Only used when the value of 'rabbitmq.indexer.binding' property 
is true. || nutch.queue ||
|| queue.options || Options used when the queue is created. Only used when the 
value of 'rabbitmq.indexer.binding' property is true. Default 
"durable=true,exclusive=false,auto-delete=false".<<BR>> It must have the form 
durable={durable},exclusive={exclusive},auto-delete={auto-delete},arguments={arguments}<<BR>>
 where: <<Include(IndexWriters/RabbitQueueOptions)>> || 
durable=true,exclusive=false,auto-delete=false ||
|| routingkey || The routing key used to publish messages to specific queues. 
It is only used when the exchange type is "topic" or "direct". Default is the 
value of 'rabbitmq.indexer.queue.name' property. ||  ||
|| commit.mode || "single" if a message contains only one document. In this 
case a header with the action (write, update or delete) will be added. 
"multiple" if a message contains all documents. Default "multiple". || multiple 
||
|| commit.size || Amount of documents to send into each message if the value of 
'rabbitmq.indexer.commit.mode' property is "multiple". Default "250". || 250 ||
|| headers.static || Headers to add to each message. It must have the form 
key1=value1,key2=value2. ||  ||
|| headers.dynamic || Document's fields to add as headers to each message. It 
must have the form field1,field2. ||  ||

=== Elasticsearch rest indexer properties ===

||'''Parameter Name''' ||'''Description''' ||'''Default value''' ||
|| host || The hostname or a list of comma separated hostnames to send 
documents to using Elasticsearch Jest. Both host and port must be defined. ||  
||
|| port || The port to connect to using Elasticsearch Jest. || 9200 Check this 
number||
|| index || Default index to send documents to. || nutch ||
|| max.bulk.docs || Maximum size of the bulk in number of documents. || 250 ||
|| max.bulk.size || Maximum size of the bulk in bytes. || 2500500 Check this 
number||
|| user || Username for auth credentials (only used when https is enabled) || 
user ||
|| password || Password for auth credentials (only used when https is enabled) 
|| password ||
|| type || Default type to send documents to. || doc ||
|| https || "true" to enable https, "false" to disable https If you've disabled 
http access (by forcing https), be sure to set this to true, otherwise you 
might get "connection reset by peer". || false ||
|| trustallhostnames || "true" to trust elasticsearch server's certificate even 
if its listed domain name does not match the domain they are hosted on "false" 
to check if the elasticsearch server's certificate's listed domain is the same 
domain that it is hosted on, and if it doesn't, then fail to index (only used 
when https is enabled) || false ||
|| languages || A list of strings denoting the supported languages (e.g. 
`en,de,fr,it`). If this value is empty all documents will be sent to index 
${elastic.rest.index}. If not empty the Rest client will distribute documents 
in different indices based on their `lang` property. Indices are named with the 
following schema: ${elastic.rest.index}${elastic.rest.index.separator}${lang} 
(e.g. `nutch_de`). Entries with an unsupported `lang` value will be added to 
index 
${elastic.rest.index}${elastic.rest.index.separator}${elastic.rest.index.sink} 
(e.g. `nutch_others`). ||  ||
|| separator || Default value is `_`. Is used only if 
`elastic.rest.index.languages` is defined to build the index name (i.e. 
${elastic.rest.index}${elastic.rest.index.separator}${lang}).  || _ ||
|| sink || Default value is `others`. Is used only if 
`elastic.rest.index.languages` is defined to build the index name where to 
store documents with unsupported languages (i.e. 
${elastic.rest.index}${elastic.rest.index.separator}${elastic.rest.index.sink}).
 || others ||

Reply via email to