Thanks. I will take a look at that.

On Sunday, February 22, 2015, Jiaxin Ye <jiaxi...@usc.edu> wrote:

> You are absolutely right! I am just throwing ideas :) If you are looking
> at local data, org.apache.nutch.segment.SegmentReader may be helpful I
> guess. As all data contents parsed are located there.
>
> On Sun, Feb 22, 2015 at 4:33 PM, Renxia Wang <renxi...@usc.edu
> <javascript:_e(%7B%7D,'cvml','renxi...@usc.edu');>> wrote:
>
>> Thank you for you suggestion. I will take a look at that. There is a
>> URLUtil class in nutch's source code, but I am just wonder if that one will
>> send a request to the URL again to get the data. Cause the url's metadata
>> has already been downloaded, it is better if we can get the data locally.
>>
>>
>> On Sunday, February 22, 2015, Jiaxin Ye <jiaxi...@usc.edu
>> <javascript:_e(%7B%7D,'cvml','jiaxi...@usc.edu');>> wrote:
>>
>>> Hey,
>>>
>>> I haven't started working on the deduplicatiin yet, but if I were you I
>>> will use tika library to retrieve the MIMEtype and metadata. The code is
>>> presented in the book tika. Why not try that out? :)
>>>
>>> Best,
>>> Jiaxin
>>>
>>> On Sunday, February 22, 2015, Renxia Wang <renxi...@usc.edu> wrote:
>>>
>>>> Hi
>>>>
>>>> I want to develop an UrlFIlter which takes an url, takes its metadata
>>>> or even the fetched content, then use some duplicate detection algorithms
>>>> to determine if it is a duplicate of any url in bitch. However, the only
>>>> parameter passed into the Urlfilter is the url, is it possible to get the
>>>> data I want of that input url in Urlfilter?
>>>>
>>>> Thanks,
>>>>
>>>> Zhique
>>>>
>>>
>

Reply via email to