Odd results and broken docs when indexing converted ARC-files.

Felix Zimmermann Fri, 17 Apr 2009 05:47:41 -0700

Hi,

1. after ArcSegmentCreator+updatedb+invertlinks+solrindex, I have
strange cutted and mixed content in the content-field of solr, see the
attached gif.


There are fragments of html-Tags (see No.2 in the attached gif) and
there is content where it should not be (see No. 1 in the attached
gif). 
I assume, the parsing does not work correctly(?) Or is there a problem
with charsets?


2. I'd like to exclude the header-info (see No. 3 in the attached gif).
Is there an easy way?


I use Ubuntu Server 8.10, Tomcat 6 (UTF8-Connector), Solr 1.4 dev and
Heritrix 1.15.4-200903140303.

The (little) website crawled is http://www.andreas-bock.de

I crawled the site (only writing ContentType text/html) and converted
the ARC-file in segments using:

../bin/nutch
org.apache.nutch.tools.arc.ArcSegmentCreator /[ARC-file] /[segment]

afterwards I created crawldb and linkdb using

../bin/nutch crawldb ... and
../bin/nutch invertlinks ...

then I took solrindex in order to put everything in solr.


Can somebody help?

Thank you very much!

Odd results and broken docs when indexing converted ARC-files.

Reply via email to