[ 
https://issues.apache.org/jira/browse/NUTCH-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12609308#action_12609308
 ] 

Michael Gottesman commented on NUTCH-634:
-----------------------------------------

There is actually a special thing in hadoop called the HiddenFileFilter in 
FileInputFormat (or filter I dont remember which). I recently emailed the 
hadoop dev-list and asked if that could be at the public vs private scope (it 
resolves the issue by filtering all files that being with _ i.e. _logs). The 
list said to submit a patch and it would be integrated into hadoop 0.19.

I am going to submit the hadoop patch in a few minutes. In the meantime your 
idea seems absolutely lovely.

So yes, your suggestion is prefect =).


> Patch - Nutch - Hadoop 0.17.0
> -----------------------------
>
>                 Key: NUTCH-634
>                 URL: https://issues.apache.org/jira/browse/NUTCH-634
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Michael Gottesman
>            Assignee: Andrzej Bialecki 
>             Fix For: 0.9.0
>
>         Attachments: diff, hadoop-0.17.patch, hadoop-0.17.patch
>
>
> This is a patch so that Nutch can be used with Hadoop 0.17.0. The patch is 
> located at http://pastie.org/212001
> The patch compiles and passes all current Nutch unit tests.
> I have tested that the crawler side of Nutch (i.e. inject, generate, fetch, 
> parse, merge w/crawldb) definetly works, but have not tested the lucene 
> indexing part. It might work, but it might not. 
> *NOTE* - the two main bugs that had to be overcome were not noticed by any of 
> the unit tests. The bugs only came up during actual testing. The bugs were:
> 1. Changes to the Hadoop Iterator
> 2. Addition of Serialization to MapReduce Framework

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to