Re: CSV indexer file data overwriting

Markus Jelsma Fri, 25 Nov 2022 05:04:18 -0800

Hi Paul, the account has been created. You should receive an email from
Jira in your inbox or spam box.


Thanks,
Markus

Op vr 25 nov. 2022 om 14:01 schreef Paul Escobar <
[email protected]>:

> Hello Markus,
>
> I'm very comfortable with your proposal, open source projects must take
> advantage of any little contribution no matter the way.
>
> Best,
>
> El vie, 25 nov 2022 a las 7:21, Markus Jelsma (<[email protected]
> >)
> escribió:
>
> > Hello Paul,
> >
> > > I tried to comment on this jira issue, but I don't have access,
> > unfortunately I don't know how to do it.
> >
> > Due to too much spam, it is no longer possible to create an account for
> > yourself, but we can do that for you if you wish
> >
> > Regards,
> > Markus
> >
> > Op do 24 nov. 2022 om 22:46 schreef Paul Escobar <
> > [email protected]>:
> >
> > > Hello Sebastian,
> > >
> > > I got it, csv indexer needs one task to run properly, I tested it and
> it
> > > worked. Thank you for the advice.
> > >
> > > I tried to comment on this jira issue, but I don't have access,
> > > unfortunately I don't know how to do it.
> > >
> > > I think if a commiter changed CSVIndexerWriter.java:
> > >
> > > if (fs.exists(csvLocalOutFile)) {
> > >    // clean-up
> > >    LOG.warn("Removing existing output path {}", csvLocalOutFile);
> > >    fs.delete(csvLocalOutFile, true);
> > > }
> > >
> > > Trying to append data instead of delete and create the file, the issue
> > > would be fixed in local mode, at least.
> > >
> > > Thanks again,
> > >
> > >
> > > El jue, 24 nov 2022 a las 7:38, Sebastian Nagel (<
> > > [email protected]>)
> > > escribió:
> > >
> > > > Hi Paul,
> > > >
> > > >  > the indexer was writing the
> > > >  > documents info in the file (nutch.csv) twice,
> > > >
> > > > Yes, I see. And now I know what I've overseen:
> > > >
> > > >   .../bin/nutch index -Dmapreduce.job.reduces=2
> > > >
> > > > You need to run the CSV indexer with only a single reducer.
> > > > In order to do so, please pass the option
> > > >    --num-tasks 1
> > > > to the script bin/crawl.
> > > >
> > > > Alternatively, you could change
> > > >    NUM_TASKS=2
> > > > in bin/crawl to
> > > >    NUM_TASKS=1
> > > >
> > > > This is related to why at now you can't run the CSV indexer
> > > > in (pseudo)distributed mode, see my previous note:
> > > >
> > > >  > A final note: the CSV indexer only works in local mode, it does
> not
> > > yet
> > > >  > work in distributed mode (on a real Hadoop cluster). It was
> > initially
> > > >  > thought for debugging, not for larger production set up.
> > > >
> > > > The issue is described here:
> > > >    https://issues.apache.org/jira/browse/NUTCH-2793
> > > >
> > > > It's a though one because a solution requires a change of the
> > IndexWriter
> > > > interface. Index writers are plugins and do not know from which
> reducer
> > > > task they are run and to which path on a distributed or parallelized
> > > system
> > > > they have to write. On Hadoop the writing the output is done in two
> > > steps:
> > > > write to a local file and then "commit" the output to the final
> > location
> > > > on the
> > > > distributed file system.
> > > >
> > > > But yes, should have a look again at this issue which is stalled
> since
> > > > quite
> > > > some time. Also because, it's now clear that you might run into
> issues
> > > even
> > > > in local mode.
> > > >
> > > > Thanks for reporting the issue! If you can, please also comment on
> the
> > > > Jira issue!
> > > >
> > > > Best,
> > > > Sebastian
> > > >
> > > >
> > > >
> > > >
> > >
> > > --
> > > Paul Escobar Mossos
> > > skype: paulescom
> > > telefono: +57 1 3006815404
> > >
> >
>
>
> --
> Paul Escobar Mossos
> skype: paulescom
> telefono: +57 1 3006815404
>

Re: CSV indexer file data overwriting

Reply via email to