Thanks. I will take a look at that. On Sunday, February 22, 2015, Jiaxin Ye <jiaxi...@usc.edu> wrote:
> You are absolutely right! I am just throwing ideas :) If you are looking > at local data, org.apache.nutch.segment.SegmentReader may be helpful I > guess. As all data contents parsed are located there. > > On Sun, Feb 22, 2015 at 4:33 PM, Renxia Wang <renxi...@usc.edu > <javascript:_e(%7B%7D,'cvml','renxi...@usc.edu');>> wrote: > >> Thank you for you suggestion. I will take a look at that. There is a >> URLUtil class in nutch's source code, but I am just wonder if that one will >> send a request to the URL again to get the data. Cause the url's metadata >> has already been downloaded, it is better if we can get the data locally. >> >> >> On Sunday, February 22, 2015, Jiaxin Ye <jiaxi...@usc.edu >> <javascript:_e(%7B%7D,'cvml','jiaxi...@usc.edu');>> wrote: >> >>> Hey, >>> >>> I haven't started working on the deduplicatiin yet, but if I were you I >>> will use tika library to retrieve the MIMEtype and metadata. The code is >>> presented in the book tika. Why not try that out? :) >>> >>> Best, >>> Jiaxin >>> >>> On Sunday, February 22, 2015, Renxia Wang <renxi...@usc.edu> wrote: >>> >>>> Hi >>>> >>>> I want to develop an UrlFIlter which takes an url, takes its metadata >>>> or even the fetched content, then use some duplicate detection algorithms >>>> to determine if it is a duplicate of any url in bitch. However, the only >>>> parameter passed into the Urlfilter is the url, is it possible to get the >>>> data I want of that input url in Urlfilter? >>>> >>>> Thanks, >>>> >>>> Zhique >>>> >>> >