Re: Odd results and broken docs when indexing converted ARC-files.

Dennis Kubes Fri, 17 Apr 2009 21:46:09 -0700


Ken Krugler wrote:

Hi Felix,
1. after ArcSegmentCreator+updatedb+invertlinks+solrindex, I have
strange cutted and mixed content in the content-field of solr, see the
attached gif.

There are fragments of html-Tags (see No.2 in the attached gif) and
there is content where it should not be (see No. 1 in the attached
gif).
I assume, the parsing does not work correctly(?) Or is there a problem
with charsets?


2. I'd like to exclude the header-info (see No. 3 in the attached gif).
Is there an easy way?
[snip]
I recently was working on some arc file import code, and took a quicklook at what's in Nutch. From what I can tell, it assumes that the arcfile is in gzip'd format (each record separately), so if that's not yourformat then you could be running into problems.
If you mess up with processing the record, you can dump the headers intothe content.

It does. The arc handling code in Nutch is very basic and wasoriginally for processing arc files from grub. It assumes a file whereeach record is gzipped and then appended to each other to create acomplete file.


Dennis


-- Ken

I use Ubuntu Server 8.10, Tomcat 6 (UTF8-Connector), Solr 1.4 dev and
Heritrix 1.15.4-200903140303.

The (little) website crawled is http://www.andreas-bock.de

I crawled the site (only writing ContentType text/html) and converted
the ARC-file in segments using:

../bin/nutch
org.apache.nutch.tools.arc.ArcSegmentCreator /[ARC-file] /[segment]

afterwards I created crawldb and linkdb using

../bin/nutch crawldb ... and
../bin/nutch invertlinks ...

then I took solrindex in order to put everything in solr.


Can somebody help?

Thank you very much!

Re: Odd results and broken docs when indexing converted ARC-files.

Reply via email to