Hi Talat and Julien, I just wanted to say thanks for your help. I applied the patch and all is well with SolrDeleteDuplicates after applying the patch.
Thanks, Madhvi Arora On 11/11/13 12:14 PM, "Arora, Madhvi" <[email protected]> wrote: >Thanks Julien. I will try out one of these options later today. > >Madhvi > > >On 11/11/13 11:03 AM, "Julien Nioche" <[email protected]> >wrote: > >>I have just committed this patch to trunk so you can use the latter if >>you >>don't want to have to apply the patch. >>Just use >>svn co https://svn.apache.org/repos/asf/nutch/trunk nutch-trunk >>to pull the content of the trunk >> >>Julien >> >> >> >>On 11 November 2013 13:50, <[email protected]> wrote: >> >>> Thank you so much Talat. I will try this out. Hopefully this will fix >>>my >>> problem :) >>> >>> Madhvi >>> >>> >>> >>> >>> >>> >>> On 11/11/13 8:45 AM, "Talat UYARER" <[email protected]> wrote: >>> >>> >Hi Madhvi, >>> > >>> >If you have outside Nutch data in Solr. You are right. You need this >>> >patch :) >>> > >>> >I can explain how you should do on Linux. >>> >- You should download nutch source code from >>> >http://nutch.apache.org/downloads.html >>> >- Extract source's zip file. >>> >- Download patch file next to (not inside) the source code directory. >>> >- Go to source directory in terminal >>> >- Give this command: >>> >patch -p0 < ../NUTCH-1100-1.6-1.patch >>> >- If everything is ok. You can rebuild nutch with ant. You can look at >>> >detailed build instructions at >>> >https://wiki.apache.org/nutch/HowToContribute >>> > >>> >That's all >>> >Talat >>> > >>> >11-11-2013 15:02 tarihinde, [email protected] yazdı: >>> >> No, I am not truncating the page. The url-validator checks that the >>>urls >>> >> crawled by Nutch are valid. So this plugin will probably not fix my >>> >> problem, though I have added it to my list of plugins because you >>>said >>> >> that it's good to have. >>> >> >>> >> I think that here the problem is that my SOLR index contains data >>>from >>> >> Nutch and outside Nutch. The outside Nutch indexed data does not >>>have >>> >> digest field. So while executing SolrDeleteDuplicates, Nutch is >>>giving >>> >>an >>> >> exception on not finding the digest field. Nutch needs to skip the >>> >> document if there is no digest field. >>> >> >>> >> I think that this patch will probably fix my problem. Am I correct? >>>If >>> >>yes >>> >> then how do I apply the patch or is there a jar with the fix >>>available >>> >> that I can download? >>> >> I can also add the digest field on outside Nutch data. What should >>>be >>> >>the >>> >> value in that field? >>> >> >>> >> Thanks, >>> >> Madhvi >>> >> >>> >> >>> >> >>> >> >>> >> On 11/9/13 6:30 AM, "Talat UYARER" <[email protected]> wrote: >>> >> >>> >>> Hi Madhvi, >>> >>> >>> >>> After adding urlfilter-validator in nutch-site.xml, do you truncate >>> >>>your >>> >>> webpage ? >>> >>> >>> >>> Talat >>> >>> >>> >>> >>> >>> 09-11-2013 00:35 tarihinde, [email protected] yazdı: >>> >>>> Hi Talat, >>> >>>> >>> >>>> I can re-create this exception. This exception starts happening as >>> >>>>soon >>> >>>> as >>> >>>> I index from outside Nutch. SolrDeleleteDuplicates works fine as >>>long >>> >>>>as >>> >>>> the whole solr index came from Nutch. >>> >>>> I haven't found out yet specifically which field might be causing >>>it. >>> >>>> But >>> >>>> looking at issue below, it might be because of the digest field >>>not >>> >>>> being >>> >>>> there. >>> >>>> https://issues.apache.org/jira/browse/NUTCH-1100 >>> >>>> >>> >>>> Can it be some other field? >>> >>>> >>> >>>> Also, there is a patch for digest field. How should I apply it? >>>Any >>> >>>>help >>> >>>> will be great! >>> >>>> >>> >>>> >>> >>>> Madhvi >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> On 11/6/13 2:19 PM, "Talat UYARER" <[email protected]> >>>wrote: >>> >>>> >>> >>>>> You wrote wrong. You should write like this >>> >>>>> >>> >>>>> <property> >>> >>>>> <name>plugin.includes</name> >>> >>>>> >>> >>>>> >>> >>>>>>>><value>protocol-http|urlfilter-(regex|validator)|parse-(html|tika|m >>>>>>>>e >>>>>>>>ta >>> >>>>>ta >>> >>>>> gs >>> >>>>> |js >>> >>>>> >>> >>>>> >>> >>>>>>>>|swf)|index-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer >>>>>>>>- >>>>>>>>(p >>> >>>>>as >>> >>>>> s| >>> >>>>> r >>> >>>>> egex|basic)</value> >>> >>>>> </property> >>> >>>>> >>> >>>>> And you write in nutch-site.xml after than you should rebuild >>>with >>> >>>>>ant >>> >>>>> clean runtime >>> >>>>> >>> >>>>> Talat >>> >>>>> >>> >>>>> [email protected] şunu yazdı: >>> >>>>> >>> >>>>>> Hi Talat, >>> >>>>>> No, I am not using url filter-validator plugin. Here is my list >>>of >>> >>>>>> plugins: >>> >>>>>> >>> >>>>>> <property> >>> >>>>>> <name>plugin.includes</name> >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>>>>><value>protocol-http|urlfilter-regex|parse-(html|tika|metatags|js| >>>>>>>>>s >>>>>>>>>wf >>> >>>>>>)| >>> >>>>>> in >>> >>>>>> de >>> >>>>>> >>> >>>>>> >>> >>>>>>>>>x-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-(pass|re >>>>>>>>>g >>>>>>>>>ex >>> >>>>>>|b >>> >>>>>> as >>> >>>>>> ic >>> >>>>>> )</value> >>> >>>>>> </property> >>> >>>>>> >>> >>>>>> >>> >>>>>> Do I just need to change this to: >>> >>>>>> >>> >>>>>> <property> >>> >>>>>> <name>plugin.includes</name> >>> >>>>>> >>> >>>>>> >>> >>>>>>>>><value>protocol-http|urlfilter-regex|parse|validator-(html|tika|me >>>>>>>>>t >>>>>>>>>at >>> >>>>>>ag >>> >>>>>> s| >>> >>>>>> js >>> >>>>>> >>> >>>>>> >>> >>>>>>>>>|swf)|index-(basic|anchor|metadata|more)|scoring-opic|urlnormalize >>>>>>>>>r >>>>>>>>>-( >>> >>>>>>pa >>> >>>>>> ss >>> >>>>>> |r >>> >>>>>> egex|basic)</value> >>> >>>>>> </property> >>> >>>>>> >>> >>>>>> Thank you so much, >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> Madhvi >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> On 11/6/13 1:08 PM, "Talat UYARER" <[email protected]> >>>wrote: >>> >>>>>> >>> >>>>>>> Hi Madhvi, >>> >>>>>>> >>> >>>>>>> Can you tell me what is your active plugins in your >>>nutch-site.xml. >>> >>>>>>> I am >>> >>>>>>> not sure but we have a issue simalar this. if your solr return >>> >>>>>>>null, >>> >>>>>>> this >>> >>>>>>> will because this issue. Please check your solr return data >>> >>>>>>> >>> >>>>>>> You can look at >>>https://issues.apache.org/jira/browse/NUTCH-1100 >>> >>>>>>> >>> >>>>>>> if yours is same, you should use urlfilter-validator plugin. >>> >>>>>>> >>> >>>>>>> Urlfilter-validator has lots of benifit. i told in >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> http://mail-archives.apache.org/mod_mbox/nutch-user/201310.mbox/%3c5 >>> >>>>>>>26 >>> >>>>>>> 5B >>> >>>>>>> C2 >>> >>>>>>> [email protected]%3e >>> >>>>>>> >>> >>>>>>> Talat >>> >>>>>>> >>> >>>>>>> [email protected] şunu yazdı: >>> >>>>>>> >>> >>>>>>>> I am going to start my own thread rather than being under >>> >>>>>>>>javozzo's >>> >>>>>>>> thread :)! >>> >>>>>>>> >>> >>>>>>>> Hi, >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> I am using Nutch 1.5.1 and Solr 3.6 and having problem with >>> >>>>>>>>command >>> >>>>>>>> SolrDeleteDuplicates. Looking at Hadoop logs: I am getting >>>error: >>> >>>>>>>> >>> >>>>>>>> java.lang.NullPointerException >>> >>>>>>>> at org.apache.hadoop.io.Text.encode(Text.java:388) >>> >>>>>>>> at org.apache.hadoop.io.Text.set(Text.java:178) >>> >>>>>>>> at >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>>>>>org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputForm >>>>>>>>>>>a >>>>>>>>>>>t$ >>> >>>>>>>>1. >>> >>>>>>>> ne >>> >>>>>>>> xt >>> >>>>>>>> (S >>> >>>>>>>> olrDeleteDuplicates.java:270) >>> >>>>>>>> at >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>>>>>org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputForm >>>>>>>>>>>a >>>>>>>>>>>t$ >>> >>>>>>>>1. >>> >>>>>>>> ne >>> >>>>>>>> xt >>> >>>>>>>> (S >>> >>>>>>>> olrDeleteDuplicates.java:241) >>> >>>>>>>> at >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>>>>>org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext( >>>>>>>>>>>M >>>>>>>>>>>ap >>> >>>>>>>>Ta >>> >>>>>>>> sk >>> >>>>>>>> .j >>> >>>>>>>> av >>> >>>>>>>> a:236) >>> >>>>>>>> at >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>>>>>org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTas >>>>>>>>>>>k >>>>>>>>>>>.j >>> >>>>>>>>av >>> >>>>>>>> a: >>> >>>>>>>> 21 >>> >>>>>>>> 6) >>> >>>>>>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) >>> >>>>>>>> at >>>org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) >>> >>>>>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) >>> >>>>>>>> at >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>>>>>org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.j >>>>>>>>>>>a >>>>>>>>>>>va >>> >>>>>>>>:2 >>> >>>>>>>> 12 >>> >>>>>>>> ) >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> Also had another question about updating Nutch to 1.6 and 1.7. >>>I >>> >>>>>>>>had >>> >>>>>>>> tried >>> >>>>>>>> updating to newer version of Nutch but got exception during >>> >>>>>>>>deleting >>> >>>>>>>> duplicates in SOLR. After lot of research online found that a >>> >>>>>>>>field >>> >>>>>>>> had >>> >>>>>>>> changed. A few said digest field and others said that url >>>field is >>> >>>>>>>> no >>> >>>>>>>> longer there. So here are my questions: >>> >>>>>>>> 1: Is there a newer solr mapping file that needs to be used? >>> >>>>>>>> 2: Can the SOLR index from 1.5.1 and index from newer version >>> >>>>>>>> co-exist >>> >>>>>>>> or >>> >>>>>>>> we need to re-index from one version of Nutch? >>> >>>>>>>> >>> >>>>>>>> I will really appreciate any help with this. >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> Thanks in advance, >>> >>>>>>>> Madhvi >>> >>>>>>>> >>> >>>>>>>> Madhvi Arora >>> >>>>>>>> AutomationDirect >>> >>>>>>>> The #1 Best Mid-Sized Company to work for in >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>>Atlanta< >>> http://www.ajc.com/business/topworkplaces/automationdirect- >>> >>>>>>>>co >>> >>>>>>>> m- >>> >>>>>>>> to >>> >>>>>>>> p-midsize-1421260.html> 2012 >>> >>>>>>>> >>> >>>>>> >>> >>>> >>> >>> >>> >> >>> > >>> >>> >> >> >>-- >> >>Open Source Solutions for Text Engineering >> >>http://digitalpebble.blogspot.com/ >>http://www.digitalpebble.com >>http://twitter.com/digitalpebble >

