[ 
https://issues.apache.org/jira/browse/NUTCH-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135803#comment-17135803
 ] 

Patrick Mézard commented on NUTCH-2792:
---------------------------------------

In [3], I would prefix with the indexwriter *identifier*. To override the 
default csv indexwriter outpath, I would:
{code:java}
-param indexer_csv_1.outpath=/some/path {code}
But what is the difference between dynamic parameters and other ones? You 
suggested to reuse the properties syntax here, why not reuse the properties 
entirely?
{code:java}
-Dindexwriters.indexer_csv_1.outpath=/some/path {code}
 

The behaviour would be roughly:
 * You can override any indexwriters property that way from command line
 * You can define them in nutch-site.xml if you wish (but there is no strong 
reason to advertise this imho).
 * Properties in index-writers.xml are implicitely mapped to 
"-Dindexwriters.$writer_id.$property"

The only thing I am not completely happy with is the overriding order. My gut 
feeling would have been:
 - "Command-line property > index-writers.xml > nutch-site.xml

But I suspect the properties are not handled by the command itself but by 
hadoop via Tool, or something else. So we cannot tell "Command-line property" 
from "nutch-site.xml", and the behaviour would be:
 * "Command-line property" > nutch-site.xml > index-writers.xml

Probably not a big deal in practice, just a little weird since nutch-site.xml 
defines the location of index-writers.xml, hence feel more *global*.

The implementation does not look too crazy either. At the end of 
"IndexWriters.loadWritersConfiguration", just iterate on "indexwriters.*" keys 
from the global configuration. For each key:
 * Extract the writer id prefix. If it is not in the IndexWriterConfig, fail 
(or at least log an error).
 * Add all Configuration keys for this writer in IndexWriterConfig, overwriting 
existing ones.

Once this is done, deprecate "nutch index -params".

What do you think?

> nutch index -params is only used in Solr indexer
> ------------------------------------------------
>
>                 Key: NUTCH-2792
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2792
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.17
>            Reporter: Patrick Mézard
>            Priority: Minor
>             Fix For: 1.18
>
>
> `nutch index` help displays:
> {code:java}
>  General options:
> ...
>  -params k1=v1&k2=v2... parameters passed to indexer plugins
>  (via property indexer.additional.params){code}
> The option does nothing when used with CSV or dummy indexers. Looking at the 
> code, the property is defined in:
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L78]
> which is only used in:
> [https://github.com/apache/nutch/blob/master/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java#L141]
> Several possibilities:
>  * Drop the parameter from the help. Does not break backward compatibility.
>  * Move the -params handling in IndexWriters.java and add them to 
> IndexWriterParams of every indexer. Not too impactful but not super clean 
> either: the parameters are not "namespaced" per indexer, if someone uses 
> multiple indexers there may be parameter collisions.
>  * Refactor the way these parameters are passed, to prefix them with target 
> indexer. Would break backward compatibility. In that case, it would be good 
> to change the format completely: turn -params into -param, allow multiple 
> values to be passed and forget the '=/&' syntax (which does not handle 
> escaping anyway).
> Not sure how much this parameter is used. I would have used it to configure 
> the output path for indexer-csv or indexer-dummy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to