Re: Solr Delete Duplicates

marora Mon, 18 Nov 2013 08:30:52 -0800

Hi Talat and Julien,

I just wanted to say thanks for your help. I applied the patch and all is
well with SolrDeleteDuplicates after applying the patch.


Thanks,
Madhvi Arora

On 11/11/13 12:14 PM, "Arora, Madhvi" <[email protected]> wrote:


>Thanks Julien. I will try out one of these options later today.
>
>Madhvi
>
>
>On 11/11/13 11:03 AM, "Julien Nioche" <[email protected]>
>wrote:
>
>>I have just committed this patch to trunk so you can use the latter if
>>you
>>don't want to have to apply the patch.
>>Just use
>>svn co https://svn.apache.org/repos/asf/nutch/trunk nutch-trunk
>>to pull the content of the trunk
>>
>>Julien
>>
>>
>>
>>On 11 November 2013 13:50, <[email protected]> wrote:
>>
>>> Thank you so much Talat. I will try this out. Hopefully this  will fix
>>>my
>>> problem :)
>>>
>>> Madhvi
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 11/11/13 8:45 AM, "Talat UYARER" <[email protected]> wrote:
>>>
>>> >Hi Madhvi,
>>> >
>>> >If you have outside Nutch data in Solr. You are right. You need this
>>> >patch :)
>>> >
>>> >I can explain how you should do on Linux.
>>> >- You should download nutch source code from
>>> >http://nutch.apache.org/downloads.html
>>> >- Extract source's zip file.
>>> >- Download patch file next to (not inside) the source code directory.
>>> >- Go to source directory in terminal
>>> >- Give this command:
>>> >patch -p0  < ../NUTCH-1100-1.6-1.patch
>>> >- If everything is ok. You can rebuild nutch with ant. You can look at
>>> >detailed build instructions at
>>> >https://wiki.apache.org/nutch/HowToContribute
>>> >
>>> >That's all
>>> >Talat
>>> >
>>> >11-11-2013 15:02 tarihinde, [email protected] yazdı:
>>> >> No, I am not truncating the page. The url-validator checks that the
>>>urls
>>> >> crawled by Nutch are valid. So this plugin will probably not fix my
>>> >> problem, though I have added it to my list of plugins because you
>>>said
>>> >> that it's good to have.
>>> >>
>>> >> I think that here the problem is that my SOLR index contains data
>>>from
>>> >> Nutch and outside Nutch. The outside Nutch indexed data does not
>>>have
>>> >> digest field. So while executing SolrDeleteDuplicates, Nutch is
>>>giving
>>> >>an
>>> >> exception on not finding the digest field. Nutch needs to skip the
>>> >> document if there is no digest field.
>>> >>
>>> >> I think that this patch will probably fix my problem. Am I correct?
>>>If
>>> >>yes
>>> >> then how do I apply the patch or is there a jar with the fix
>>>available
>>> >> that I can download?
>>> >> I can also add the digest field on outside Nutch data. What should
>>>be
>>> >>the
>>> >> value in that field?
>>> >>
>>> >> Thanks,
>>> >> Madhvi
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> On 11/9/13 6:30 AM, "Talat UYARER" <[email protected]> wrote:
>>> >>
>>> >>> Hi Madhvi,
>>> >>>
>>> >>> After adding urlfilter-validator in nutch-site.xml, do you truncate
>>> >>>your
>>> >>> webpage ?
>>> >>>
>>> >>> Talat
>>> >>>
>>> >>>
>>> >>> 09-11-2013 00:35 tarihinde, [email protected] yazdı:
>>> >>>> Hi Talat,
>>> >>>>
>>> >>>> I can re-create this exception. This exception starts happening as
>>> >>>>soon
>>> >>>> as
>>> >>>> I index from outside Nutch. SolrDeleleteDuplicates works fine as
>>>long
>>> >>>>as
>>> >>>> the whole solr index came from Nutch.
>>> >>>> I haven't found out yet specifically which field might be causing
>>>it.
>>> >>>> But
>>> >>>> looking at issue below, it might be because of the digest field
>>>not
>>> >>>> being
>>> >>>> there.
>>> >>>>    https://issues.apache.org/jira/browse/NUTCH-1100
>>> >>>>
>>> >>>> Can it be some other field?
>>> >>>>
>>> >>>> Also, there is a patch for digest field. How should I apply it?
>>>Any
>>> >>>>help
>>> >>>> will be great!
>>> >>>>
>>> >>>>
>>> >>>> Madhvi
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> On 11/6/13 2:19 PM, "Talat UYARER" <[email protected]>
>>>wrote:
>>> >>>>
>>> >>>>> You wrote wrong. You should write like this
>>> >>>>>
>>> >>>>> <property>
>>> >>>>> <name>plugin.includes</name>
>>> >>>>>
>>> >>>>>
>>> 
>>>>>>>><value>protocol-http|urlfilter-(regex|validator)|parse-(html|tika|m
>>>>>>>>e
>>>>>>>>ta
>>> >>>>>ta
>>> >>>>> gs
>>> >>>>> |js
>>> >>>>>
>>> >>>>>
>>> 
>>>>>>>>|swf)|index-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer
>>>>>>>>-
>>>>>>>>(p
>>> >>>>>as
>>> >>>>> s|
>>> >>>>> r
>>> >>>>> egex|basic)</value>
>>> >>>>> </property>
>>> >>>>>
>>> >>>>> And you write in nutch-site.xml after than you should rebuild
>>>with
>>> >>>>>ant
>>> >>>>> clean runtime
>>> >>>>>
>>> >>>>> Talat
>>> >>>>>
>>> >>>>> [email protected] şunu yazdı:
>>> >>>>>
>>> >>>>>> Hi Talat,
>>> >>>>>> No, I am not using url filter-validator plugin. Here is my list
>>>of
>>> >>>>>> plugins:
>>> >>>>>>
>>> >>>>>> <property>
>>> >>>>>>    <name>plugin.includes</name>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> 
>>>>>>>>><value>protocol-http|urlfilter-regex|parse-(html|tika|metatags|js|
>>>>>>>>>s
>>>>>>>>>wf
>>> >>>>>>)|
>>> >>>>>> in
>>> >>>>>> de
>>> >>>>>>
>>> >>>>>>
>>> 
>>>>>>>>>x-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-(pass|re
>>>>>>>>>g
>>>>>>>>>ex
>>> >>>>>>|b
>>> >>>>>> as
>>> >>>>>> ic
>>> >>>>>> )</value>
>>> >>>>>> </property>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Do I just need to change this to:
>>> >>>>>>
>>> >>>>>> <property>
>>> >>>>>> <name>plugin.includes</name>
>>> >>>>>>
>>> >>>>>>
>>> 
>>>>>>>>><value>protocol-http|urlfilter-regex|parse|validator-(html|tika|me
>>>>>>>>>t
>>>>>>>>>at
>>> >>>>>>ag
>>> >>>>>> s|
>>> >>>>>> js
>>> >>>>>>
>>> >>>>>>
>>> 
>>>>>>>>>|swf)|index-(basic|anchor|metadata|more)|scoring-opic|urlnormalize
>>>>>>>>>r
>>>>>>>>>-(
>>> >>>>>>pa
>>> >>>>>> ss
>>> >>>>>> |r
>>> >>>>>> egex|basic)</value>
>>> >>>>>> </property>
>>> >>>>>>
>>> >>>>>> Thank you so much,
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Madhvi
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> On 11/6/13 1:08 PM, "Talat UYARER" <[email protected]>
>>>wrote:
>>> >>>>>>
>>> >>>>>>> Hi Madhvi,
>>> >>>>>>>
>>> >>>>>>> Can you tell me what is your active plugins in your
>>>nutch-site.xml.
>>> >>>>>>> I am
>>> >>>>>>> not sure but we have a issue simalar this. if your solr return
>>> >>>>>>>null,
>>> >>>>>>> this
>>> >>>>>>> will because this issue. Please check your solr return data
>>> >>>>>>>
>>> >>>>>>> You can look at
>>>https://issues.apache.org/jira/browse/NUTCH-1100
>>> >>>>>>>
>>> >>>>>>> if yours is same, you should use urlfilter-validator plugin.
>>> >>>>>>>
>>> >>>>>>> Urlfilter-validator has lots of benifit.  i told in
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> http://mail-archives.apache.org/mod_mbox/nutch-user/201310.mbox/%3c5
>>> >>>>>>>26
>>> >>>>>>> 5B
>>> >>>>>>> C2
>>> >>>>>>> [email protected]%3e
>>> >>>>>>>
>>> >>>>>>> Talat
>>> >>>>>>>
>>> >>>>>>> [email protected] şunu yazdı:
>>> >>>>>>>
>>> >>>>>>>> I am going to start my own thread rather than being under
>>> >>>>>>>>javozzo's
>>> >>>>>>>> thread :)!
>>> >>>>>>>>
>>> >>>>>>>> Hi,
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> I am using Nutch 1.5.1 and Solr 3.6 and having problem with
>>> >>>>>>>>command
>>> >>>>>>>> SolrDeleteDuplicates. Looking at Hadoop logs: I am getting
>>>error:
>>> >>>>>>>>
>>> >>>>>>>> java.lang.NullPointerException
>>> >>>>>>>> at org.apache.hadoop.io.Text.encode(Text.java:388)
>>> >>>>>>>> at org.apache.hadoop.io.Text.set(Text.java:178)
>>> >>>>>>>> at
>>> >>>>>>>>
>>> >>>>>>>>
>>> 
>>>>>>>>>>>org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputForm
>>>>>>>>>>>a
>>>>>>>>>>>t$
>>> >>>>>>>>1.
>>> >>>>>>>> ne
>>> >>>>>>>> xt
>>> >>>>>>>> (S
>>> >>>>>>>> olrDeleteDuplicates.java:270)
>>> >>>>>>>> at
>>> >>>>>>>>
>>> >>>>>>>>
>>> 
>>>>>>>>>>>org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputForm
>>>>>>>>>>>a
>>>>>>>>>>>t$
>>> >>>>>>>>1.
>>> >>>>>>>> ne
>>> >>>>>>>> xt
>>> >>>>>>>> (S
>>> >>>>>>>> olrDeleteDuplicates.java:241)
>>> >>>>>>>> at
>>> >>>>>>>>
>>> >>>>>>>>
>>> 
>>>>>>>>>>>org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(
>>>>>>>>>>>M
>>>>>>>>>>>ap
>>> >>>>>>>>Ta
>>> >>>>>>>> sk
>>> >>>>>>>> .j
>>> >>>>>>>> av
>>> >>>>>>>> a:236)
>>> >>>>>>>> at
>>> >>>>>>>>
>>> >>>>>>>>
>>> 
>>>>>>>>>>>org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTas
>>>>>>>>>>>k
>>>>>>>>>>>.j
>>> >>>>>>>>av
>>> >>>>>>>> a:
>>> >>>>>>>> 21
>>> >>>>>>>> 6)
>>> >>>>>>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>>> >>>>>>>> at 
>>>org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>>> >>>>>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>>> >>>>>>>> at
>>> >>>>>>>>
>>> >>>>>>>>
>>> 
>>>>>>>>>>>org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.j
>>>>>>>>>>>a
>>>>>>>>>>>va
>>> >>>>>>>>:2
>>> >>>>>>>> 12
>>> >>>>>>>> )
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> Also had another question about updating Nutch to 1.6 and 1.7.
>>>I
>>> >>>>>>>>had
>>> >>>>>>>> tried
>>> >>>>>>>> updating to newer version of Nutch but got exception during
>>> >>>>>>>>deleting
>>> >>>>>>>> duplicates in SOLR. After lot of research online found that a
>>> >>>>>>>>field
>>> >>>>>>>> had
>>> >>>>>>>> changed. A few said digest field and others said that url
>>>field is
>>> >>>>>>>> no
>>> >>>>>>>> longer there. So here are my questions:
>>> >>>>>>>> 1:  Is there a newer solr mapping file that needs to be used?
>>> >>>>>>>> 2: Can the SOLR index from 1.5.1 and index from newer version
>>> >>>>>>>> co-exist
>>> >>>>>>>> or
>>> >>>>>>>> we need to re-index from one version of Nutch?
>>> >>>>>>>>
>>> >>>>>>>> I will really appreciate any help with this.
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> Thanks in advance,
>>> >>>>>>>> Madhvi
>>> >>>>>>>>
>>> >>>>>>>> Madhvi Arora
>>> >>>>>>>> AutomationDirect
>>> >>>>>>>> The #1 Best Mid-Sized Company to work for in
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>Atlanta<
>>> http://www.ajc.com/business/topworkplaces/automationdirect-
>>> >>>>>>>>co
>>> >>>>>>>> m-
>>> >>>>>>>> to
>>> >>>>>>>> p-midsize-1421260.html> 2012
>>> >>>>>>>>
>>> >>>>>>
>>> >>>>
>>> >>>
>>> >>
>>> >
>>>
>>>
>>
>>
>>-- 
>>
>>Open Source Solutions for Text Engineering
>>
>>http://digitalpebble.blogspot.com/
>>http://www.digitalpebble.com
>>http://twitter.com/digitalpebble
>

Re: Solr Delete Duplicates

Reply via email to