Hi Sebastian,

Thanks for your tips. I have switched on debugging for YARN, and kept 
"launch_container.sh" for a few minutes to be able to examine. HADOOP AND NUTCH 
CONF + HOME directories were correctly set for AM as well as MR.YarnChild. 
CLASSPATH has been set correctly to Nutch configuration, therefore 
nutch-site.xml should be picked up. As I've realized, some "job.xml" is 
attached to the submission from my remote computer, which includes any 
parameter set by the remote JVM by a HadoopConfiguration. This means the only 
way to configure such a remote launch is to pass configuration parameters 
programatically.

For example:
val hConf = new HadoopConfiguration()
hConf.set(..., ...)
hConf.set(..., ...)

val injection = new Injection(hConf)
injection.inject(...)

The above is just a pseudo code. Sorry if there are any mistakes.

Cheers,
Zoltán
On 2017-07-19 17:43:13, Sebastian Nagel <wastl.na...@googlemail.com> wrote:
Hi Zoltán,

a warning ahead: personally, I've never tried to control Nutch launch remotely,
so I know no solution.

If the property "plugin.folders" is not known this means Nutch
also didn't read nutch-default.xml where it is defined. I would start
to look at the classpath whether it contains the configuration
folder (local mode) or the apache-nutch-*.job file (distributed mode).

Note that the environment variable NUTCH_CONF_DIR is used only by
bin/nutch - the path is added to the classpath. Loading of configuration
files (nutch-site.xml and nutch-default.xml) is delegated to Hadoop.
Similarly, NUTCH_HOME is only used to find the Nutch installation or
the job file.

To analyze the problem, try to set
log4j.logger.org.apache.hadoop=WARN
to INFO or DEBUG.

Best,
Sebastian

On 07/18/2017 08:50 PM, Zoltán Zvara wrote:
> Dear Community,
>
> I'm running Inject job programatically, from within IntelliJ, where the 
> target cluster's (YARN) configuration and Nutch configuration is in the 
> classpath. In addition to this, HADOOP and NUTCH CONF and HOME directories 
> are set - to distributions that I have on my local machine.
>
> Starting the program, the Nutch Inject connects to YARN 2.8.0 and the inject 
> job starts correctly. However, during the initialization (setup) phase of the 
> mapper (InjectMapper), an exception is thrown:
>
> Caused by: java.lang.IllegalArgumentException: plugin.folders is not defined
> at 
> org.apache.nutch.plugin.PluginManifestParser.parsePluginFolder(PluginManifestParser.java:78)
> at org.apache.nutch.plugin.PluginRepository.(PluginRepository.java:71)
> at org.apache.nutch.plugin.PluginRepository.get(PluginRepository.java:99)
> at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:117)
> at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:70)
>
> On the YARN NodeManagers, a Nutch distribution is sitting with a 
> configuration (nutch-site.xml) that has a key "plugin.folders" that points to 
> the plugin folders by an absolute path. As for YARN, I've set up additional 
> environment variables for NMs, as follows:
>
>
> yarn.nodemanager.admin-env
> MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX,NUTCH_CONF_DIR=/opt/apache-nutch-1.13/conf/,NUTCH_HOME=/opt/apache-nutch-1.13/
>
>
> In addition to this, I have set MR environment variables as well:
>
>
> mapred.child.env
> NUTCH_HOME=/opt/apache-nutch-1.13,NUTCH_CONF_DIR=/opt/apache-nutch-1.13/conf
>
>
> I've tried to run the program with JVM parameters, supplied with -D to define 
> "plugin.folders".
>
> Probably I'm missing something. How should I define "plugin.folders", when 
> the inject job is submitted and run remotely.
>
> Thanks for helping me out.
>
> Zoltán
>

Reply via email to