Re: Solr Delete Duplicates

marora Mon, 11 Nov 2013 05:51:50 -0800

Thank you so much Talat. I will try this out. Hopefully this  will fix my
problem :)


Madhvi 






On 11/11/13 8:45 AM, "Talat UYARER" <[email protected]> wrote:

>Hi Madhvi,
>
>If you have outside Nutch data in Solr. You are right. You need this
>patch :)
>
>I can explain how you should do on Linux.
>- You should download nutch source code from
>http://nutch.apache.org/downloads.html
>- Extract source's zip file.
>- Download patch file next to (not inside) the source code directory.
>- Go to source directory in terminal
>- Give this command:
>patch -p0  < ../NUTCH-1100-1.6-1.patch
>- If everything is ok. You can rebuild nutch with ant. You can look at
>detailed build instructions at
>https://wiki.apache.org/nutch/HowToContribute
>
>That's all
>Talat
>
>11-11-2013 15:02 tarihinde, [email protected] yazdı:
>> No, I am not truncating the page. The url-validator checks that the urls
>> crawled by Nutch are valid. So this plugin will probably not fix my
>> problem, though I have added it to my list of plugins because you said
>> that it's good to have.
>>
>> I think that here the problem is that my SOLR index contains data from
>> Nutch and outside Nutch. The outside Nutch indexed data does not have
>> digest field. So while executing SolrDeleteDuplicates, Nutch is giving
>>an
>> exception on not finding the digest field. Nutch needs to skip the
>> document if there is no digest field.
>>
>> I think that this patch will probably fix my problem. Am I correct? If
>>yes
>> then how do I apply the patch or is there a jar with the fix available
>> that I can download?
>> I can also add the digest field on outside Nutch data. What should be
>>the
>> value in that field?
>>
>> Thanks,
>> Madhvi
>>
>>
>>
>>
>> On 11/9/13 6:30 AM, "Talat UYARER" <[email protected]> wrote:
>>
>>> Hi Madhvi,
>>>
>>> After adding urlfilter-validator in nutch-site.xml, do you truncate
>>>your
>>> webpage ?
>>>
>>> Talat
>>>
>>>
>>> 09-11-2013 00:35 tarihinde, [email protected] yazdı:
>>>> Hi Talat,
>>>>
>>>> I can re-create this exception. This exception starts happening as
>>>>soon
>>>> as
>>>> I index from outside Nutch. SolrDeleleteDuplicates works fine as long
>>>>as
>>>> the whole solr index came from Nutch.
>>>> I haven't found out yet specifically which field might be causing it.
>>>> But
>>>> looking at issue below, it might be because of the digest field not
>>>> being
>>>> there.
>>>>    https://issues.apache.org/jira/browse/NUTCH-1100
>>>>
>>>> Can it be some other field?
>>>>
>>>> Also, there is a patch for digest field. How should I apply it? Any
>>>>help
>>>> will be great!
>>>>
>>>>
>>>> Madhvi
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 11/6/13 2:19 PM, "Talat UYARER" <[email protected]> wrote:
>>>>
>>>>> You wrote wrong. You should write like this
>>>>>
>>>>> <property>
>>>>> <name>plugin.includes</name>
>>>>>
>>>>> 
>>>>><value>protocol-http|urlfilter-(regex|validator)|parse-(html|tika|meta
>>>>>ta
>>>>> gs
>>>>> |js
>>>>>
>>>>> 
>>>>>|swf)|index-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-(p
>>>>>as
>>>>> s|
>>>>> r
>>>>> egex|basic)</value>
>>>>> </property>
>>>>>
>>>>> And you write in nutch-site.xml after than you should rebuild with
>>>>>ant
>>>>> clean runtime
>>>>>
>>>>> Talat
>>>>>
>>>>> [email protected] şunu yazdı:
>>>>>
>>>>>> Hi Talat,
>>>>>> No, I am not using url filter-validator plugin. Here is my list of
>>>>>> plugins:
>>>>>>
>>>>>> <property>
>>>>>>    <name>plugin.includes</name>
>>>>>>
>>>>>>
>>>>>> 
>>>>>><value>protocol-http|urlfilter-regex|parse-(html|tika|metatags|js|swf
>>>>>>)|
>>>>>> in
>>>>>> de
>>>>>>
>>>>>> 
>>>>>>x-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-(pass|regex
>>>>>>|b
>>>>>> as
>>>>>> ic
>>>>>> )</value>
>>>>>> </property>
>>>>>>
>>>>>>
>>>>>> Do I just need to change this to:
>>>>>>
>>>>>> <property>
>>>>>> <name>plugin.includes</name>
>>>>>>
>>>>>> 
>>>>>><value>protocol-http|urlfilter-regex|parse|validator-(html|tika|metat
>>>>>>ag
>>>>>> s|
>>>>>> js
>>>>>>
>>>>>> 
>>>>>>|swf)|index-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-(
>>>>>>pa
>>>>>> ss
>>>>>> |r
>>>>>> egex|basic)</value>
>>>>>> </property>
>>>>>>
>>>>>> Thank you so much,
>>>>>>
>>>>>>
>>>>>>
>>>>>> Madhvi
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 11/6/13 1:08 PM, "Talat UYARER" <[email protected]> wrote:
>>>>>>
>>>>>>> Hi Madhvi,
>>>>>>>
>>>>>>> Can you tell me what is your active plugins in your nutch-site.xml.
>>>>>>> I am
>>>>>>> not sure but we have a issue simalar this. if your solr return
>>>>>>>null,
>>>>>>> this
>>>>>>> will because this issue. Please check your solr return data
>>>>>>>
>>>>>>> You can look at https://issues.apache.org/jira/browse/NUTCH-1100
>>>>>>>
>>>>>>> if yours is same, you should use urlfilter-validator plugin.
>>>>>>>
>>>>>>> Urlfilter-validator has lots of benifit.  i told in
>>>>>>>
>>>>>>> 
>>>>>>>http://mail-archives.apache.org/mod_mbox/nutch-user/201310.mbox/%3c5
>>>>>>>26
>>>>>>> 5B
>>>>>>> C2
>>>>>>> [email protected]%3e
>>>>>>>
>>>>>>> Talat
>>>>>>>
>>>>>>> [email protected] şunu yazdı:
>>>>>>>
>>>>>>>> I am going to start my own thread rather than being under
>>>>>>>>javozzo's
>>>>>>>> thread :)!
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>
>>>>>>>> I am using Nutch 1.5.1 and Solr 3.6 and having problem with
>>>>>>>>command
>>>>>>>> SolrDeleteDuplicates. Looking at Hadoop logs: I am getting error:
>>>>>>>>
>>>>>>>> java.lang.NullPointerException
>>>>>>>> at org.apache.hadoop.io.Text.encode(Text.java:388)
>>>>>>>> at org.apache.hadoop.io.Text.set(Text.java:178)
>>>>>>>> at
>>>>>>>>
>>>>>>>> 
>>>>>>>>org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$
>>>>>>>>1.
>>>>>>>> ne
>>>>>>>> xt
>>>>>>>> (S
>>>>>>>> olrDeleteDuplicates.java:270)
>>>>>>>> at
>>>>>>>>
>>>>>>>> 
>>>>>>>>org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$
>>>>>>>>1.
>>>>>>>> ne
>>>>>>>> xt
>>>>>>>> (S
>>>>>>>> olrDeleteDuplicates.java:241)
>>>>>>>> at
>>>>>>>>
>>>>>>>> 
>>>>>>>>org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(Map
>>>>>>>>Ta
>>>>>>>> sk
>>>>>>>> .j
>>>>>>>> av
>>>>>>>> a:236)
>>>>>>>> at
>>>>>>>>
>>>>>>>> 
>>>>>>>>org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.j
>>>>>>>>av
>>>>>>>> a:
>>>>>>>> 21
>>>>>>>> 6)
>>>>>>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>>>>>>>> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>>>>>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>>>>>>>> at
>>>>>>>>
>>>>>>>> 
>>>>>>>>org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java
>>>>>>>>:2
>>>>>>>> 12
>>>>>>>> )
>>>>>>>>
>>>>>>>>
>>>>>>>> Also had another question about updating Nutch to 1.6 and 1.7. I
>>>>>>>>had
>>>>>>>> tried
>>>>>>>> updating to newer version of Nutch but got exception during
>>>>>>>>deleting
>>>>>>>> duplicates in SOLR. After lot of research online found that a
>>>>>>>>field
>>>>>>>> had
>>>>>>>> changed. A few said digest field and others said that url field is
>>>>>>>> no
>>>>>>>> longer there. So here are my questions:
>>>>>>>> 1:  Is there a newer solr mapping file that needs to be used?
>>>>>>>> 2: Can the SOLR index from 1.5.1 and index from newer version
>>>>>>>> co-exist
>>>>>>>> or
>>>>>>>> we need to re-index from one version of Nutch?
>>>>>>>>
>>>>>>>> I will really appreciate any help with this.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks in advance,
>>>>>>>> Madhvi
>>>>>>>>
>>>>>>>> Madhvi Arora
>>>>>>>> AutomationDirect
>>>>>>>> The #1 Best Mid-Sized Company to work for in
>>>>>>>>
>>>>>>>> 
>>>>>>>>Atlanta<http://www.ajc.com/business/topworkplaces/automationdirect-
>>>>>>>>co
>>>>>>>> m-
>>>>>>>> to
>>>>>>>> p-midsize-1421260.html> 2012
>>>>>>>>
>>>>>>
>>>>
>>>
>>
>

Re: Solr Delete Duplicates

Reply via email to