Ken Krugler wrote:
Hi Felix,

1. after ArcSegmentCreator+updatedb+invertlinks+solrindex, I have
strange cutted and mixed content in the content-field of solr, see the
attached gif.

There are fragments of html-Tags (see No.2 in the attached gif) and
there is content where it should not be (see No. 1 in the attached
gif).
I assume, the parsing does not work correctly(?) Or is there a problem
with charsets?


2. I'd like to exclude the header-info (see No. 3 in the attached gif).
Is there an easy way?

[snip]

I recently was working on some arc file import code, and took a quick look at what's in Nutch. From what I can tell, it assumes that the arc file is in gzip'd format (each record separately), so if that's not your format then you could be running into problems.

If you mess up with processing the record, you can dump the headers into the content.

It does. The arc handling code in Nutch is very basic and was originally for processing arc files from grub. It assumes a file where each record is gzipped and then appended to each other to create a complete file.

Dennis


-- Ken


I use Ubuntu Server 8.10, Tomcat 6 (UTF8-Connector), Solr 1.4 dev and
Heritrix 1.15.4-200903140303.

The (little) website crawled is http://www.andreas-bock.de

I crawled the site (only writing ContentType text/html) and converted
the ARC-file in segments using:

../bin/nutch
org.apache.nutch.tools.arc.ArcSegmentCreator /[ARC-file] /[segment]

afterwards I created crawldb and linkdb using

../bin/nutch crawldb ... and
../bin/nutch invertlinks ...

then I took solrindex in order to put everything in solr.


Can somebody help?

Thank you very much!


Reply via email to