Hi Madhvi,
If you have outside Nutch data in Solr. You are right. You need this
patch :)
I can explain how you should do on Linux.
- You should download nutch source code from
http://nutch.apache.org/downloads.html
- Extract source's zip file.
- Download patch file next to (not inside) the source code directory.
- Go to source directory in terminal
- Give this command:
patch -p0 < ../NUTCH-1100-1.6-1.patch
- If everything is ok. You can rebuild nutch with ant. You can look at
detailed build instructions at https://wiki.apache.org/nutch/HowToContribute
That's all
Talat
11-11-2013 15:02 tarihinde, [email protected] yazdı:
No, I am not truncating the page. The url-validator checks that the urls
crawled by Nutch are valid. So this plugin will probably not fix my
problem, though I have added it to my list of plugins because you said
that it's good to have.
I think that here the problem is that my SOLR index contains data from
Nutch and outside Nutch. The outside Nutch indexed data does not have
digest field. So while executing SolrDeleteDuplicates, Nutch is giving an
exception on not finding the digest field. Nutch needs to skip the
document if there is no digest field.
I think that this patch will probably fix my problem. Am I correct? If yes
then how do I apply the patch or is there a jar with the fix available
that I can download?
I can also add the digest field on outside Nutch data. What should be the
value in that field?
Thanks,
Madhvi
On 11/9/13 6:30 AM, "Talat UYARER" <[email protected]> wrote:
Hi Madhvi,
After adding urlfilter-validator in nutch-site.xml, do you truncate your
webpage ?
Talat
09-11-2013 00:35 tarihinde, [email protected] yazdı:
Hi Talat,
I can re-create this exception. This exception starts happening as soon
as
I index from outside Nutch. SolrDeleleteDuplicates works fine as long as
the whole solr index came from Nutch.
I haven't found out yet specifically which field might be causing it.
But
looking at issue below, it might be because of the digest field not
being
there.
https://issues.apache.org/jira/browse/NUTCH-1100
Can it be some other field?
Also, there is a patch for digest field. How should I apply it? Any help
will be great!
Madhvi
On 11/6/13 2:19 PM, "Talat UYARER" <[email protected]> wrote:
You wrote wrong. You should write like this
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-(regex|validator)|parse-(html|tika|metata
gs
|js
|swf)|index-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-(pas
s|
r
egex|basic)</value>
</property>
And you write in nutch-site.xml after than you should rebuild with ant
clean runtime
Talat
[email protected] şunu yazdı:
Hi Talat,
No, I am not using url filter-validator plugin. Here is my list of
plugins:
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika|metatags|js|swf)|
in
de
x-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-(pass|regex|b
as
ic
)</value>
</property>
Do I just need to change this to:
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse|validator-(html|tika|metatag
s|
js
|swf)|index-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-(pa
ss
|r
egex|basic)</value>
</property>
Thank you so much,
Madhvi
On 11/6/13 1:08 PM, "Talat UYARER" <[email protected]> wrote:
Hi Madhvi,
Can you tell me what is your active plugins in your nutch-site.xml.
I am
not sure but we have a issue simalar this. if your solr return null,
this
will because this issue. Please check your solr return data
You can look at https://issues.apache.org/jira/browse/NUTCH-1100
if yours is same, you should use urlfilter-validator plugin.
Urlfilter-validator has lots of benifit. i told in
http://mail-archives.apache.org/mod_mbox/nutch-user/201310.mbox/%3c526
5B
C2
[email protected]%3e
Talat
[email protected] şunu yazdı:
I am going to start my own thread rather than being under javozzo's
thread :)!
Hi,
I am using Nutch 1.5.1 and Solr 3.6 and having problem with command
SolrDeleteDuplicates. Looking at Hadoop logs: I am getting error:
java.lang.NullPointerException
at org.apache.hadoop.io.Text.encode(Text.java:388)
at org.apache.hadoop.io.Text.set(Text.java:178)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.
ne
xt
(S
olrDeleteDuplicates.java:270)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.
ne
xt
(S
olrDeleteDuplicates.java:241)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTa
sk
.j
av
a:236)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.jav
a:
21
6)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:2
12
)
Also had another question about updating Nutch to 1.6 and 1.7. I had
tried
updating to newer version of Nutch but got exception during deleting
duplicates in SOLR. After lot of research online found that a field
had
changed. A few said digest field and others said that url field is
no
longer there. So here are my questions:
1: Is there a newer solr mapping file that needs to be used?
2: Can the SOLR index from 1.5.1 and index from newer version
co-exist
or
we need to re-index from one version of Nutch?
I will really appreciate any help with this.
Thanks in advance,
Madhvi
Madhvi Arora
AutomationDirect
The #1 Best Mid-Sized Company to work for in
Atlanta<http://www.ajc.com/business/topworkplaces/automationdirect-co
m-
to
p-midsize-1421260.html> 2012