Scott Green wrote:
Thanks you for the detailed explanation, Andrzej.

My plugin contains one language-model(configuration file) whose size
is 40M, and could you please suggest me where the model file should
put.
a) put it into nutch/conf dir like "regex-urlfilter.txt" file
b) put it into plugin's jar package.

From the purely theoretic point of view, either way it should work fine - the content of conf/ dir is packed into the job jar too.

One comment though, and I hope I'm not confusing you too much ;) If the file is that large, AND you execute your jobs using jobtracker/tasktrackers, AND you run on Hadoop DFS, you may want to do exactly the opposite from what I advocated ;) I.e. keep this file in a well-known external location on DFS, where it's accessible to all tasks. You should also set its replication factor equal to the number of datanodes, and then load this file directly from DFS. Still, you wouldn't use java.io.File, but FileSystem.open(Path).

The reason is that if you pack this file into your job JAR, the job jar would become very large (presumably this 40MB is already compressed?). Job jar needs to be copied to each tasktracker for each task, so you will experience performance hit just because of the size of the job jar ... whereas if this file sits on DFS and is highly replicated, its content will always be available locally.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to