Thank you for you suggestion. I will take a look at that. There is a
URLUtil class in nutch's source code, but I am just wonder if that one will
send a request to the URL again to get the data. Cause the url's metadata
has already been downloaded, it is better if we can get the data locally.

On Sunday, February 22, 2015, Jiaxin Ye <jiaxi...@usc.edu> wrote:

> Hey,
>
> I haven't started working on the deduplicatiin yet, but if I were you I
> will use tika library to retrieve the MIMEtype and metadata. The code is
> presented in the book tika. Why not try that out? :)
>
> Best,
> Jiaxin
>
> On Sunday, February 22, 2015, Renxia Wang <renxi...@usc.edu
> <javascript:_e(%7B%7D,'cvml','renxi...@usc.edu');>> wrote:
>
>> Hi
>>
>> I want to develop an UrlFIlter which takes an url, takes its metadata or
>> even the fetched content, then use some duplicate detection algorithms to
>> determine if it is a duplicate of any url in bitch. However, the only
>> parameter passed into the Urlfilter is the url, is it possible to get the
>> data I want of that input url in Urlfilter?
>>
>> Thanks,
>>
>> Zhique
>>
>

Reply via email to