Hi, 1. after ArcSegmentCreator+updatedb+invertlinks+solrindex, I have strange cutted and mixed content in the content-field of solr, see the attached gif.
There are fragments of html-Tags (see No.2 in the attached gif) and there is content where it should not be (see No. 1 in the attached gif). I assume, the parsing does not work correctly(?) Or is there a problem with charsets? 2. I'd like to exclude the header-info (see No. 3 in the attached gif). Is there an easy way? I use Ubuntu Server 8.10, Tomcat 6 (UTF8-Connector), Solr 1.4 dev and Heritrix 1.15.4-200903140303. The (little) website crawled is http://www.andreas-bock.de I crawled the site (only writing ContentType text/html) and converted the ARC-file in segments using: ../bin/nutch org.apache.nutch.tools.arc.ArcSegmentCreator /[ARC-file] /[segment] afterwards I created crawldb and linkdb using ../bin/nutch crawldb ... and ../bin/nutch invertlinks ... then I took solrindex in order to put everything in solr. Can somebody help? Thank you very much!
