Re: CSV indexer file data overwriting

Sebastian Nagel Thu, 24 Nov 2022 04:38:46 -0800

Hi Paul,

> the indexer was writing the
> documents info in the file (nutch.csv) twice,


Yes, I see. And now I know what I've overseen:

 .../bin/nutch index -Dmapreduce.job.reduces=2

You need to run the CSV indexer with only a single reducer.
In order to do so, please pass the option
  --num-tasks 1
to the script bin/crawl.

Alternatively, you could change
  NUM_TASKS=2
in bin/crawl to
  NUM_TASKS=1

This is related to why at now you can't run the CSV indexer
in (pseudo)distributed mode, see my previous note:

> A final note: the CSV indexer only works in local mode, it does not yet
> work in distributed mode (on a real Hadoop cluster). It was initially
> thought for debugging, not for larger production set up.

The issue is described here:
  https://issues.apache.org/jira/browse/NUTCH-2793

It's a though one because a solution requires a change of the IndexWriterinterface. Index writers are plugins and do not know from which reducer

task they are run and to which path on a distributed or parallelized system
they have to write. On Hadoop the writing the output is done in two steps:

write to a local file and then "commit" the output to the final location on thedistributed file system.


But yes, should have a look again at this issue which is stalled since quite
some time. Also because, it's now clear that you might run into issues even
in local mode.

Thanks for reporting the issue! If you can, please also comment on the Jira 
issue!

Best,
Sebastian

Re: CSV indexer file data overwriting

Reply via email to