Thanks you for the detailed explanation, Andrzej.

My plugin contains one language-model(configuration file) whose size
is 40M, and could you please suggest me where the model file should
put.
a) put it into nutch/conf dir like "regex-urlfilter.txt" file
b) put it into plugin's jar package.

On 1/17/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
Scott Green wrote:
> Well, why should all resources needed to be packed?

Because when you run Nutch on a Hadoop cluster, Hadoop requires that all
job resources be packed into a job JAR, which is then submitted to each
tasktracker as a part of the job. So, if you want to run in non-local
mode you have to build the nutch-xxx.job JAR ("ant job" target).

Apparently you are running in so called "local" mode, where these issues
are quite muddy - but as soon as you try to execute it on a cluster your
method will stop working.


> The built result may looks like:
>
> xxx-plugin
>  `--- conf
>  `--- web
>  `--- xxx-plugin.jar
>  `--- deps.jar
>  `-- plugin.xml

Again: in the "local" mode this may work, but these unpacked plugins are
not available for jobs executing on a Hadoop cluster.

>
>> Now, you may have tested your method and found that it does indeed work
>> - but the reason is a bit obscure: the bin/nutch and bin/hadoop scripts
>> add your build/ directory to the classpath, so that you can locally test
>> the latest versions of the code without creating the *.job file.
>> However, when you run your code on a Hadoop cluster your local build/
>> directory is no longer accessible, and your method will mysteriously
>> fail - or even worse, you may get a different version of a resource from
>> an older version of the build/ directory found on Hadoop tasktracker
>> nodes ...
>
> If you packed everything into jar(s), it is possible that the jar on
> hadoop tasktracker node is old version, right?

No. The job jar is always up to date, because it is sent with every job.

But if you don't get the resources from this jar, and instead rely on
using java.io.File-s, you may pick some old cruft from the local build/
directory that you may have accidentally deployed to your tasktrackers ...

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Reply via email to