Re: CSV indexer file data overwriting

Paul Escobar Fri, 25 Nov 2022 05:01:34 -0800

Hello Markus,

I'm very comfortable with your proposal, open source projects must take
advantage of any little contribution no matter the way.


Best,

El vie, 25 nov 2022 a las 7:21, Markus Jelsma (<markus.jel...@openindex.io>)
escribió:

> Hello Paul,
>
> > I tried to comment on this jira issue, but I don't have access,
> unfortunately I don't know how to do it.
>
> Due to too much spam, it is no longer possible to create an account for
> yourself, but we can do that for you if you wish
>
> Regards,
> Markus
>
> Op do 24 nov. 2022 om 22:46 schreef Paul Escobar <
> paul.escobar.mos...@gmail.com>:
>
> > Hello Sebastian,
> >
> > I got it, csv indexer needs one task to run properly, I tested it and it
> > worked. Thank you for the advice.
> >
> > I tried to comment on this jira issue, but I don't have access,
> > unfortunately I don't know how to do it.
> >
> > I think if a commiter changed CSVIndexerWriter.java:
> >
> > if (fs.exists(csvLocalOutFile)) {
> >    // clean-up
> >    LOG.warn("Removing existing output path {}", csvLocalOutFile);
> >    fs.delete(csvLocalOutFile, true);
> > }
> >
> > Trying to append data instead of delete and create the file, the issue
> > would be fixed in local mode, at least.
> >
> > Thanks again,
> >
> >
> > El jue, 24 nov 2022 a las 7:38, Sebastian Nagel (<
> > wastl.na...@googlemail.com>)
> > escribió:
> >
> > > Hi Paul,
> > >
> > >  > the indexer was writing the
> > >  > documents info in the file (nutch.csv) twice,
> > >
> > > Yes, I see. And now I know what I've overseen:
> > >
> > >   .../bin/nutch index -Dmapreduce.job.reduces=2
> > >
> > > You need to run the CSV indexer with only a single reducer.
> > > In order to do so, please pass the option
> > >    --num-tasks 1
> > > to the script bin/crawl.
> > >
> > > Alternatively, you could change
> > >    NUM_TASKS=2
> > > in bin/crawl to
> > >    NUM_TASKS=1
> > >
> > > This is related to why at now you can't run the CSV indexer
> > > in (pseudo)distributed mode, see my previous note:
> > >
> > >  > A final note: the CSV indexer only works in local mode, it does not
> > yet
> > >  > work in distributed mode (on a real Hadoop cluster). It was
> initially
> > >  > thought for debugging, not for larger production set up.
> > >
> > > The issue is described here:
> > >    https://issues.apache.org/jira/browse/NUTCH-2793
> > >
> > > It's a though one because a solution requires a change of the
> IndexWriter
> > > interface. Index writers are plugins and do not know from which reducer
> > > task they are run and to which path on a distributed or parallelized
> > system
> > > they have to write. On Hadoop the writing the output is done in two
> > steps:
> > > write to a local file and then "commit" the output to the final
> location
> > > on the
> > > distributed file system.
> > >
> > > But yes, should have a look again at this issue which is stalled since
> > > quite
> > > some time. Also because, it's now clear that you might run into issues
> > even
> > > in local mode.
> > >
> > > Thanks for reporting the issue! If you can, please also comment on the
> > > Jira issue!
> > >
> > > Best,
> > > Sebastian
> > >
> > >
> > >
> > >
> >
> > --
> > Paul Escobar Mossos
> > skype: paulescom
> > telefono: +57 1 3006815404
> >
>


-- 
Paul Escobar Mossos
skype: paulescom
telefono: +57 1 3006815404

Re: CSV indexer file data overwriting

Reply via email to