Thank you so much Talat. I will try this out. Hopefully this will fix my problem :)
Madhvi On 11/11/13 8:45 AM, "Talat UYARER" <[email protected]> wrote: >Hi Madhvi, > >If you have outside Nutch data in Solr. You are right. You need this >patch :) > >I can explain how you should do on Linux. >- You should download nutch source code from >http://nutch.apache.org/downloads.html >- Extract source's zip file. >- Download patch file next to (not inside) the source code directory. >- Go to source directory in terminal >- Give this command: >patch -p0 < ../NUTCH-1100-1.6-1.patch >- If everything is ok. You can rebuild nutch with ant. You can look at >detailed build instructions at >https://wiki.apache.org/nutch/HowToContribute > >That's all >Talat > >11-11-2013 15:02 tarihinde, [email protected] yazdı: >> No, I am not truncating the page. The url-validator checks that the urls >> crawled by Nutch are valid. So this plugin will probably not fix my >> problem, though I have added it to my list of plugins because you said >> that it's good to have. >> >> I think that here the problem is that my SOLR index contains data from >> Nutch and outside Nutch. The outside Nutch indexed data does not have >> digest field. So while executing SolrDeleteDuplicates, Nutch is giving >>an >> exception on not finding the digest field. Nutch needs to skip the >> document if there is no digest field. >> >> I think that this patch will probably fix my problem. Am I correct? If >>yes >> then how do I apply the patch or is there a jar with the fix available >> that I can download? >> I can also add the digest field on outside Nutch data. What should be >>the >> value in that field? >> >> Thanks, >> Madhvi >> >> >> >> >> On 11/9/13 6:30 AM, "Talat UYARER" <[email protected]> wrote: >> >>> Hi Madhvi, >>> >>> After adding urlfilter-validator in nutch-site.xml, do you truncate >>>your >>> webpage ? >>> >>> Talat >>> >>> >>> 09-11-2013 00:35 tarihinde, [email protected] yazdı: >>>> Hi Talat, >>>> >>>> I can re-create this exception. This exception starts happening as >>>>soon >>>> as >>>> I index from outside Nutch. SolrDeleleteDuplicates works fine as long >>>>as >>>> the whole solr index came from Nutch. >>>> I haven't found out yet specifically which field might be causing it. >>>> But >>>> looking at issue below, it might be because of the digest field not >>>> being >>>> there. >>>> https://issues.apache.org/jira/browse/NUTCH-1100 >>>> >>>> Can it be some other field? >>>> >>>> Also, there is a patch for digest field. How should I apply it? Any >>>>help >>>> will be great! >>>> >>>> >>>> Madhvi >>>> >>>> >>>> >>>> >>>> >>>> >>>> On 11/6/13 2:19 PM, "Talat UYARER" <[email protected]> wrote: >>>> >>>>> You wrote wrong. You should write like this >>>>> >>>>> <property> >>>>> <name>plugin.includes</name> >>>>> >>>>> >>>>><value>protocol-http|urlfilter-(regex|validator)|parse-(html|tika|meta >>>>>ta >>>>> gs >>>>> |js >>>>> >>>>> >>>>>|swf)|index-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-(p >>>>>as >>>>> s| >>>>> r >>>>> egex|basic)</value> >>>>> </property> >>>>> >>>>> And you write in nutch-site.xml after than you should rebuild with >>>>>ant >>>>> clean runtime >>>>> >>>>> Talat >>>>> >>>>> [email protected] şunu yazdı: >>>>> >>>>>> Hi Talat, >>>>>> No, I am not using url filter-validator plugin. Here is my list of >>>>>> plugins: >>>>>> >>>>>> <property> >>>>>> <name>plugin.includes</name> >>>>>> >>>>>> >>>>>> >>>>>><value>protocol-http|urlfilter-regex|parse-(html|tika|metatags|js|swf >>>>>>)| >>>>>> in >>>>>> de >>>>>> >>>>>> >>>>>>x-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-(pass|regex >>>>>>|b >>>>>> as >>>>>> ic >>>>>> )</value> >>>>>> </property> >>>>>> >>>>>> >>>>>> Do I just need to change this to: >>>>>> >>>>>> <property> >>>>>> <name>plugin.includes</name> >>>>>> >>>>>> >>>>>><value>protocol-http|urlfilter-regex|parse|validator-(html|tika|metat >>>>>>ag >>>>>> s| >>>>>> js >>>>>> >>>>>> >>>>>>|swf)|index-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-( >>>>>>pa >>>>>> ss >>>>>> |r >>>>>> egex|basic)</value> >>>>>> </property> >>>>>> >>>>>> Thank you so much, >>>>>> >>>>>> >>>>>> >>>>>> Madhvi >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 11/6/13 1:08 PM, "Talat UYARER" <[email protected]> wrote: >>>>>> >>>>>>> Hi Madhvi, >>>>>>> >>>>>>> Can you tell me what is your active plugins in your nutch-site.xml. >>>>>>> I am >>>>>>> not sure but we have a issue simalar this. if your solr return >>>>>>>null, >>>>>>> this >>>>>>> will because this issue. Please check your solr return data >>>>>>> >>>>>>> You can look at https://issues.apache.org/jira/browse/NUTCH-1100 >>>>>>> >>>>>>> if yours is same, you should use urlfilter-validator plugin. >>>>>>> >>>>>>> Urlfilter-validator has lots of benifit. i told in >>>>>>> >>>>>>> >>>>>>>http://mail-archives.apache.org/mod_mbox/nutch-user/201310.mbox/%3c5 >>>>>>>26 >>>>>>> 5B >>>>>>> C2 >>>>>>> [email protected]%3e >>>>>>> >>>>>>> Talat >>>>>>> >>>>>>> [email protected] şunu yazdı: >>>>>>> >>>>>>>> I am going to start my own thread rather than being under >>>>>>>>javozzo's >>>>>>>> thread :)! >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> >>>>>>>> I am using Nutch 1.5.1 and Solr 3.6 and having problem with >>>>>>>>command >>>>>>>> SolrDeleteDuplicates. Looking at Hadoop logs: I am getting error: >>>>>>>> >>>>>>>> java.lang.NullPointerException >>>>>>>> at org.apache.hadoop.io.Text.encode(Text.java:388) >>>>>>>> at org.apache.hadoop.io.Text.set(Text.java:178) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>>org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$ >>>>>>>>1. >>>>>>>> ne >>>>>>>> xt >>>>>>>> (S >>>>>>>> olrDeleteDuplicates.java:270) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>>org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$ >>>>>>>>1. >>>>>>>> ne >>>>>>>> xt >>>>>>>> (S >>>>>>>> olrDeleteDuplicates.java:241) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>>org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(Map >>>>>>>>Ta >>>>>>>> sk >>>>>>>> .j >>>>>>>> av >>>>>>>> a:236) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>>org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.j >>>>>>>>av >>>>>>>> a: >>>>>>>> 21 >>>>>>>> 6) >>>>>>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) >>>>>>>> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) >>>>>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>>org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java >>>>>>>>:2 >>>>>>>> 12 >>>>>>>> ) >>>>>>>> >>>>>>>> >>>>>>>> Also had another question about updating Nutch to 1.6 and 1.7. I >>>>>>>>had >>>>>>>> tried >>>>>>>> updating to newer version of Nutch but got exception during >>>>>>>>deleting >>>>>>>> duplicates in SOLR. After lot of research online found that a >>>>>>>>field >>>>>>>> had >>>>>>>> changed. A few said digest field and others said that url field is >>>>>>>> no >>>>>>>> longer there. So here are my questions: >>>>>>>> 1: Is there a newer solr mapping file that needs to be used? >>>>>>>> 2: Can the SOLR index from 1.5.1 and index from newer version >>>>>>>> co-exist >>>>>>>> or >>>>>>>> we need to re-index from one version of Nutch? >>>>>>>> >>>>>>>> I will really appreciate any help with this. >>>>>>>> >>>>>>>> >>>>>>>> Thanks in advance, >>>>>>>> Madhvi >>>>>>>> >>>>>>>> Madhvi Arora >>>>>>>> AutomationDirect >>>>>>>> The #1 Best Mid-Sized Company to work for in >>>>>>>> >>>>>>>> >>>>>>>>Atlanta<http://www.ajc.com/business/topworkplaces/automationdirect- >>>>>>>>co >>>>>>>> m- >>>>>>>> to >>>>>>>> p-midsize-1421260.html> 2012 >>>>>>>> >>>>>> >>>> >>> >> >

