Hi Paul, the account has been created. You should receive an email from Jira in your inbox or spam box.
Thanks, Markus Op vr 25 nov. 2022 om 14:01 schreef Paul Escobar < paul.escobar.mos...@gmail.com>: > Hello Markus, > > I'm very comfortable with your proposal, open source projects must take > advantage of any little contribution no matter the way. > > Best, > > El vie, 25 nov 2022 a las 7:21, Markus Jelsma (<markus.jel...@openindex.io > >) > escribió: > > > Hello Paul, > > > > > I tried to comment on this jira issue, but I don't have access, > > unfortunately I don't know how to do it. > > > > Due to too much spam, it is no longer possible to create an account for > > yourself, but we can do that for you if you wish > > > > Regards, > > Markus > > > > Op do 24 nov. 2022 om 22:46 schreef Paul Escobar < > > paul.escobar.mos...@gmail.com>: > > > > > Hello Sebastian, > > > > > > I got it, csv indexer needs one task to run properly, I tested it and > it > > > worked. Thank you for the advice. > > > > > > I tried to comment on this jira issue, but I don't have access, > > > unfortunately I don't know how to do it. > > > > > > I think if a commiter changed CSVIndexerWriter.java: > > > > > > if (fs.exists(csvLocalOutFile)) { > > > // clean-up > > > LOG.warn("Removing existing output path {}", csvLocalOutFile); > > > fs.delete(csvLocalOutFile, true); > > > } > > > > > > Trying to append data instead of delete and create the file, the issue > > > would be fixed in local mode, at least. > > > > > > Thanks again, > > > > > > > > > El jue, 24 nov 2022 a las 7:38, Sebastian Nagel (< > > > wastl.na...@googlemail.com>) > > > escribió: > > > > > > > Hi Paul, > > > > > > > > > the indexer was writing the > > > > > documents info in the file (nutch.csv) twice, > > > > > > > > Yes, I see. And now I know what I've overseen: > > > > > > > > .../bin/nutch index -Dmapreduce.job.reduces=2 > > > > > > > > You need to run the CSV indexer with only a single reducer. > > > > In order to do so, please pass the option > > > > --num-tasks 1 > > > > to the script bin/crawl. > > > > > > > > Alternatively, you could change > > > > NUM_TASKS=2 > > > > in bin/crawl to > > > > NUM_TASKS=1 > > > > > > > > This is related to why at now you can't run the CSV indexer > > > > in (pseudo)distributed mode, see my previous note: > > > > > > > > > A final note: the CSV indexer only works in local mode, it does > not > > > yet > > > > > work in distributed mode (on a real Hadoop cluster). It was > > initially > > > > > thought for debugging, not for larger production set up. > > > > > > > > The issue is described here: > > > > https://issues.apache.org/jira/browse/NUTCH-2793 > > > > > > > > It's a though one because a solution requires a change of the > > IndexWriter > > > > interface. Index writers are plugins and do not know from which > reducer > > > > task they are run and to which path on a distributed or parallelized > > > system > > > > they have to write. On Hadoop the writing the output is done in two > > > steps: > > > > write to a local file and then "commit" the output to the final > > location > > > > on the > > > > distributed file system. > > > > > > > > But yes, should have a look again at this issue which is stalled > since > > > > quite > > > > some time. Also because, it's now clear that you might run into > issues > > > even > > > > in local mode. > > > > > > > > Thanks for reporting the issue! If you can, please also comment on > the > > > > Jira issue! > > > > > > > > Best, > > > > Sebastian > > > > > > > > > > > > > > > > > > > > > > -- > > > Paul Escobar Mossos > > > skype: paulescom > > > telefono: +57 1 3006815404 > > > > > > > > -- > Paul Escobar Mossos > skype: paulescom > telefono: +57 1 3006815404 >