Scott Green wrote:
Thanks you for the detailed explanation, Andrzej.
My plugin contains one language-model(configuration file) whose size
is 40M, and could you please suggest me where the model file should
put.
a) put it into nutch/conf dir like "regex-urlfilter.txt" file
b) put it into plugin's jar package.
From the purely theoretic point of view, either way it should work fine
- the content of conf/ dir is packed into the job jar too.
One comment though, and I hope I'm not confusing you too much ;) If the
file is that large, AND you execute your jobs using
jobtracker/tasktrackers, AND you run on Hadoop DFS, you may want to do
exactly the opposite from what I advocated ;) I.e. keep this file in a
well-known external location on DFS, where it's accessible to all tasks.
You should also set its replication factor equal to the number of
datanodes, and then load this file directly from DFS. Still, you
wouldn't use java.io.File, but FileSystem.open(Path).
The reason is that if you pack this file into your job JAR, the job jar
would become very large (presumably this 40MB is already compressed?).
Job jar needs to be copied to each tasktracker for each task, so you
will experience performance hit just because of the size of the job jar
... whereas if this file sits on DFS and is highly replicated, its
content will always be available locally.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com