Re: De-duplication of Nutch parsed data

Vikas Hazrati Thu, 10 May 2012 02:49:42 -0700

Hi Markus,

Thanks for your response. My responses inline


On Thu, May 10, 2012 at 12:34 AM, Markus Jelsma
<[email protected]>wrote:

> hi
>
>
> On Thu, 10 May 2012 00:26:40 +0530, Vikas Hazrati <[email protected]>
> wrote:
>
>> Any ideas?
>>
>> On Tue, May 8, 2012 at 4:44 PM, Vikas Hazrati <[email protected]> wrote:
>>
>>  Hi,
>>>
>>> A few days back there was a discussion on the way to extract data from
>>> raw
>>> html content (
>>>
>>> http://lucene.472066.n3.**nabble.com/Getting-the-parsed-**
>>> HTML-content-back-td3916555.**html<http://lucene.472066.n3.nabble.com/Getting-the-parsed-HTML-content-back-td3916555.html>
>>> )
>>> and how to read it as DOM. We have a custom parser which ends up working
>>> on
>>> the raw content.
>>>
>>>
>>> This is how it works for us
>>> Crawl cycle - Custom URL Filter - Custom Parser - Rest of Nutch plugins
>>>
>>> In the custom parser, we end up parsing content as DOM and populating our
>>> database.
>>>
>>>
>>> I am wondering can Nutch do anything in this scenario to help in
>>> de-duplication of content OR would it be the responsibility of the parse
>>> logic to also verify if the content is duplicate or not by keeping a hash
>>> of already existing content?
>>>
>>
> What do you want to deduplicate? CrawlDB records based on what? Segment
> records? ParseData? ParseText?
>
> Primarily parse text. But your questions have got me thinking. I guess the
parse text would be/might be  different because of the dynamic content that
might appear on the page at different times right? Parse data would be
mostly meta and outlinks which is not as interesting.

Would nutch have helped if we were getting the same parsed text?
Nevertheless, since the data is extracted and persisted before it reaches
the segment, it should be the parser custom parser which is responsible.


>>> I see that there is a Nutch plugin for Solr dedup,
>>> http://wiki.apache.org/nutch/**bin/nutch%20solrdedup<http://wiki.apache.org/nutch/bin/nutch%20solrdedup>but
>>>  we are not using
>>> Solr.
>>>
>>> Also for the link deduplication, is my assumption correct that CrawlDB
>>> would not allow duplicate links to get inside it?
>>>
>>
> What link deduplication do you mean? CrawlDB records have a unique key on
> the URL.
>

Ok good, that helps.

>
>
>>> Regards | Vikas
>>> www.knoldus.com
>>>
>>>
>>>
> --
> Markus Jelsma - CTO - Openindex
>

Re: De-duplication of Nutch parsed data

Reply via email to