Re: Solr Delete Duplicates

Talat UYARER Mon, 11 Nov 2013 05:46:45 -0800

Hi Madhvi,

If you have outside Nutch data in Solr. You are right. You need thispatch :)


I can explain how you should do on Linux.

- You should download nutch source code fromhttp://nutch.apache.org/downloads.html

- Extract source's zip file.
- Download patch file next to (not inside) the source code directory.
- Go to source directory in terminal
- Give this command:
patch -p0  < ../NUTCH-1100-1.6-1.patch

- If everything is ok. You can rebuild nutch with ant. You can look atdetailed build instructions at https://wiki.apache.org/nutch/HowToContribute


That's all
Talat

11-11-2013 15:02 tarihinde, [email protected] yazdı:

No, I am not truncating the page. The url-validator checks that the urls
crawled by Nutch are valid. So this plugin will probably not fix my
problem, though I have added it to my list of plugins because you said
that it's good to have.

I think that here the problem is that my SOLR index contains data from
Nutch and outside Nutch. The outside Nutch indexed data does not have
digest field. So while executing SolrDeleteDuplicates, Nutch is giving an
exception on not finding the digest field. Nutch needs to skip the
document if there is no digest field.

I think that this patch will probably fix my problem. Am I correct? If yes
then how do I apply the patch or is there a jar with the fix available
that I can download?
I can also add the digest field on outside Nutch data. What should be the
value in that field?

Thanks,
Madhvi




On 11/9/13 6:30 AM, "Talat UYARER" <[email protected]> wrote:

Hi Madhvi,

After adding urlfilter-validator in nutch-site.xml, do you truncate your
webpage ?

Talat


09-11-2013 00:35 tarihinde, [email protected] yazdı:

Hi Talat,

I can re-create this exception. This exception starts happening as soon
as
I index from outside Nutch. SolrDeleleteDuplicates works fine as long as
the whole solr index came from Nutch.
I haven't found out yet specifically which field might be causing it.
But
looking at issue below, it might be because of the digest field not
being
there.
   https://issues.apache.org/jira/browse/NUTCH-1100

Can it be some other field?

Also, there is a patch for digest field. How should I apply it? Any help
will be great!


Madhvi






On 11/6/13 2:19 PM, "Talat UYARER" <[email protected]> wrote:

You wrote wrong. You should write like this

<property>
<name>plugin.includes</name>

<value>protocol-http|urlfilter-(regex|validator)|parse-(html|tika|metata
gs
|js

|swf)|index-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-(pas
s|
r
egex|basic)</value>
</property>

And you write in nutch-site.xml after than you should rebuild with ant
clean runtime

Talat

[email protected] şunu yazdı:

Hi Talat,
No, I am not using url filter-validator plugin. Here is my list of
plugins:

<property>
   <name>plugin.includes</name>


<value>protocol-http|urlfilter-regex|parse-(html|tika|metatags|js|swf)|
in
de

x-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-(pass|regex|b
as
ic
)</value>
</property>


Do I just need to change this to:

<property>
<name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse|validator-(html|tika|metatag
s|
js

|swf)|index-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-(pa
ss
|r
egex|basic)</value>
</property>

Thank you so much,



Madhvi







On 11/6/13 1:08 PM, "Talat UYARER" <[email protected]> wrote:

Hi Madhvi,

Can you tell me what is your active plugins in your nutch-site.xml.
I am
not sure but we have a issue simalar this. if your solr return null,
this
will because this issue. Please check your solr return data

You can look at https://issues.apache.org/jira/browse/NUTCH-1100

if yours is same, you should use urlfilter-validator plugin.

Urlfilter-validator has lots of benifit.  i told in

http://mail-archives.apache.org/mod_mbox/nutch-user/201310.mbox/%3c526
5B
C2
[email protected]%3e

Talat

[email protected] şunu yazdı:

I am going to start my own thread rather than being under javozzo's
thread :)!

Hi,


I am using Nutch 1.5.1 and Solr 3.6 and having problem with command
SolrDeleteDuplicates. Looking at Hadoop logs: I am getting error:

java.lang.NullPointerException
at org.apache.hadoop.io.Text.encode(Text.java:388)
at org.apache.hadoop.io.Text.set(Text.java:178)
at

org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.
ne
xt
(S
olrDeleteDuplicates.java:270)
at

org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.
ne
xt
(S
olrDeleteDuplicates.java:241)
at

org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTa
sk
.j
av
a:236)
at

org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.jav
a:
21
6)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at

org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:2
12
)


Also had another question about updating Nutch to 1.6 and 1.7. I had
tried
updating to newer version of Nutch but got exception during deleting
duplicates in SOLR. After lot of research online found that a field
had
changed. A few said digest field and others said that url field is
no
longer there. So here are my questions:
1:  Is there a newer solr mapping file that needs to be used?
2: Can the SOLR index from 1.5.1 and index from newer version
co-exist
or
we need to re-index from one version of Nutch?

I will really appreciate any help with this.


Thanks in advance,
Madhvi

Madhvi Arora
AutomationDirect
The #1 Best Mid-Sized Company to work for in

Atlanta<http://www.ajc.com/business/topworkplaces/automationdirect-co
m-
to
p-midsize-1421260.html> 2012

Re: Solr Delete Duplicates

Reply via email to