Hello Markus, I'm very comfortable with your proposal, open source projects must take advantage of any little contribution no matter the way.
Best, El vie, 25 nov 2022 a las 7:21, Markus Jelsma (<markus.jel...@openindex.io>) escribió: > Hello Paul, > > > I tried to comment on this jira issue, but I don't have access, > unfortunately I don't know how to do it. > > Due to too much spam, it is no longer possible to create an account for > yourself, but we can do that for you if you wish > > Regards, > Markus > > Op do 24 nov. 2022 om 22:46 schreef Paul Escobar < > paul.escobar.mos...@gmail.com>: > > > Hello Sebastian, > > > > I got it, csv indexer needs one task to run properly, I tested it and it > > worked. Thank you for the advice. > > > > I tried to comment on this jira issue, but I don't have access, > > unfortunately I don't know how to do it. > > > > I think if a commiter changed CSVIndexerWriter.java: > > > > if (fs.exists(csvLocalOutFile)) { > > // clean-up > > LOG.warn("Removing existing output path {}", csvLocalOutFile); > > fs.delete(csvLocalOutFile, true); > > } > > > > Trying to append data instead of delete and create the file, the issue > > would be fixed in local mode, at least. > > > > Thanks again, > > > > > > El jue, 24 nov 2022 a las 7:38, Sebastian Nagel (< > > wastl.na...@googlemail.com>) > > escribió: > > > > > Hi Paul, > > > > > > > the indexer was writing the > > > > documents info in the file (nutch.csv) twice, > > > > > > Yes, I see. And now I know what I've overseen: > > > > > > .../bin/nutch index -Dmapreduce.job.reduces=2 > > > > > > You need to run the CSV indexer with only a single reducer. > > > In order to do so, please pass the option > > > --num-tasks 1 > > > to the script bin/crawl. > > > > > > Alternatively, you could change > > > NUM_TASKS=2 > > > in bin/crawl to > > > NUM_TASKS=1 > > > > > > This is related to why at now you can't run the CSV indexer > > > in (pseudo)distributed mode, see my previous note: > > > > > > > A final note: the CSV indexer only works in local mode, it does not > > yet > > > > work in distributed mode (on a real Hadoop cluster). It was > initially > > > > thought for debugging, not for larger production set up. > > > > > > The issue is described here: > > > https://issues.apache.org/jira/browse/NUTCH-2793 > > > > > > It's a though one because a solution requires a change of the > IndexWriter > > > interface. Index writers are plugins and do not know from which reducer > > > task they are run and to which path on a distributed or parallelized > > system > > > they have to write. On Hadoop the writing the output is done in two > > steps: > > > write to a local file and then "commit" the output to the final > location > > > on the > > > distributed file system. > > > > > > But yes, should have a look again at this issue which is stalled since > > > quite > > > some time. Also because, it's now clear that you might run into issues > > even > > > in local mode. > > > > > > Thanks for reporting the issue! If you can, please also comment on the > > > Jira issue! > > > > > > Best, > > > Sebastian > > > > > > > > > > > > > > > > -- > > Paul Escobar Mossos > > skype: paulescom > > telefono: +57 1 3006815404 > > > -- Paul Escobar Mossos skype: paulescom telefono: +57 1 3006815404