I have a question regarding the DeleteDuplicates class in
nutch. Currently I can make the call:
deleteDuplicates.run(new String[]{a + "/indexes", b + "/indexes" } );
which would do the needful, since each of these would contain "index"
subdirectories.
However, suppose I have two "complete" indexes:
a + "/index" and b + "/index" that I want to delete duplicates from
before merging them. None of these have subdirectories and a call that
looks like:
deleteDuplicates.run(new String[]{a + "/index", b + "/index" } ); will fail.
I can get around this by copying these directories in turn into
another and then running DeleteDuplicates.run() on that directory,
which would add some unnecessary computation.
Is there a simple modification I can make perhaps to the
DeleteDuplicates.dedup method so that I won't have to do this
relatively wasteful computation of copying these index directories
under a single directory just to run the DeleteDuplicates operation?
If not, is my best bet to create a directory and then softlink to
these two directories (making them virtual subdirectories), to save
computations?
Thanks a ton,
Vijay