Re: [Nutch-general] intranet crawl problems: mime types; .doc-related exceptions; really, really slow crawl + possible infinite loop

Chris Mattmann Wed, 30 Aug 2006 13:15:44 -0700

Hi there Tomi,

On 8/30/06 12:25 PM, "Tomi NA" <[EMAIL PROTECTED]> wrote:

> I'm attempting to crawl a single samba mounted share. During testing,
> I'm crawling like this:
> 
> ./bin/nutch crawl urls -dir crawldir4 -depth 2 -topN 20
> 
> I'm using luke 0.6 to query and analyze the index.
> 
> PROBLEMS
> 
> 1.) search by file type doesn't work
> I expected that a search "file type:pdf" would have returned a list of
> files on the local filesystem, but it does not.

I believe that the keyword is "type", so your query should be "type:pdf"
(without the quotes). I'm not positive about this either, but I believe you
have to give the fully qualified mimeType, as in "application/pdf". Not
definitely sure about that though so you should experiment.

Additionally, in order for the mimeTypes to be indexed properly, you need to
have the index-more plugin enabled. Check your
$NUTCH_HOME/conf/nutch-site.xml, and look for the property "plugin.includes"
and make sure that the index-more plugin is enabled there.

> 
> 2.) invalid nutch file type detection
> I see the following in the hadoop.log:
> -----------
> 2006-08-30 15:12:07,766 WARN  parse.ParseUtil - Unable to successfully
> parse content file:/mnt/bobdocs/acta.zip of type application/zip
> 2006-08-30 15:12:07,766 WARN  fetcher.Fetcher - Error parsing:
> file:/mnt/bobdocs/acta.zip: failed(2,202): Content truncated at
> 1024000 bytes. Parser can't handle incomplete pdf file.
> -----------
> acta.zip is a .zip file, not a .pdf. Don't have any idea why this happens.

This may result from the contentType returned by the web server for
"acta.zip". Check the web server that the file is hosted on, and see what
the server responds for the contentType for that file.

Additionally, you may want to check if magic is enabled for mimeTypes. This
allows the mimeType to be sensed through the use of hex codes compared with
the beginning of each file.

> 
> 3.) Why is the TextParser mapped to application/pdf and what has that
> have to do with indexing a .txt file?
> ---------
> 2006-08-30 15:12:02,593 INFO  fetcher.Fetcher - fetching
> file:/mnt/bobdocs/popis-vg-procisceni.txt
> 2006-08-30 15:12:02,916 WARN  parse.ParserFactory -
> ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to
> contentType application/pdf via parse-plugins.xml, but its plugin.xml
> file does not claim to support contentType: application/pdf
> ---------

The TextParser * was * enabled as a last resort sort of means of extracting
* some * content from a PDF file, that is, if the parse-pdf plugin wasn't
enabled, or it failed for some reason. Since parse-text is the 2nd option
for parsing PDF files, there most likely was some sort of error in the
original PDF parser. The way that the ParserFactory works now is that it
iterates through a preference list of parsers (specified in
$NUTCH_HOME/conf/parse-plugins.xml), and tries to parse the underlying
content. The first successful parse is returned back to the Fetcher.

> 
> 4.) Some .doc files can't be indexed, although I can open them via
> openoffice 2 with no problems
> ---------
> 2006-08-30 15:12:02,991 WARN  parse.ParseUtil - Unable to successfully
> parse content file:/mnt/bobdocs/cards2005.doc of type
> application/msword
> 2006-08-30 15:12:02,991 WARN  fetcher.Fetcher - Error parsing:
> file:/mnt/bobdocs/cards2005.doc: failed(2,0): Can't be handled as
> micrsosoft document. java.lang.StringIndexOutOfBoundsException: String
> in
> dex out of range: -1024
> ---------

What version of MS Word were you trying to index? I believe that the POI
library used by the word parser can only handle certain versions of MS Word
documents, although I'm not positive about this.

As for 5 and 6 I'm not entirely sure about those problems. I wish you luck
in solving both of them though, and hope what I said above helps you out.

Thanks!

Cheers,
  Chris

> 
> 5.) MoreIndexingFilter doesn't seem to work
> The relevant part of the hadoop.log file:
> ---------
> 2006-08-30 15:13:40,235 WARN  more.MoreIndexingFilter -
> file:/mnt/bobdocs/EU2007-2013.pdforg.apache.nutch.util.mime.MimeTypeException:
> The type can not be null or empty
> ---------
> This happens with other file types, as well:
> ---------
> 2006-08-30 15:13:54,697 WARN  more.MoreIndexingFilter -
> file:/mnt/bobdocs/popis-vg-procisceni.txtorg.apache.nutch.util.mime.MimeTypeEx
> ception:
> The type can not be null or empty
> ---------
> 
> 6.) At the moment, I'm crawling the same directory (/mnt/bobdocs), the
> crawl process seems to be stuck in an infinite loop and I have no way
> of knowing what's going on as the .log isn't flushed until the process
> finishes.
> 
> 
> ENVIRONMENT
> 
> logs/hadoop.log inspection reveals things like this:
> 
> My (relevant) crawl settings are:
> 
> ---------
>   <name>db.max.anchor.length</name>
>   <value>511</value>
> 
>   <name>db.max.outlinks.per.page</name>
>   <value>-1</value>
> 
>   <name>fetcher.server.delay</name>
>   <value>0</value>
> 
>   <name>fetcher.threads.fetch</name>
>   <value>5</value>
> 
>   <name>fetcher.verbose</name>
>   <value>true</value>
> 
>   <name>file.content.limit</name>
>   <value>102400000</value>
> 
>   <name>parser.character.encoding.default</name>
>   <value>iso8859-2</value>
> 
>   <name>indexer.max.title.length</name>
>   <value>511</value>
> 
>   <name>indexer.mergeFactor</name>
>   <value>5</value>
> 
>   <name>indexer.minMergeDocs</name>
>   <value>5</value>
> 
>   <name>plugin.includes</name>
> <value>nutch-extensionpoints|protocol-(file|http)|urlfilter-regex|parse-(text|
> html|msword|pdf|mspowerpoint|msexcel|rtf|js)|index-(basic|more)|query-(basic|s
> ite|url|more)|summary-basic|scoring-opic</value>
> 
>   <name>searcher.max.hits</name>
>   <value>100</value>
> ---------
> 
> 
> MISC. SUGGESTIONS
> 
> Add the following configuration options to the nutch-*.xml files:
> * allow search by date or extension (with no other criteria)
> * always flush log to disk (at every log addition).
> 
> TIA,
> t.n.a.

______________________________________________
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] intranet crawl problems: mime types; .doc-related exceptions; really, really slow crawl + possible infinite loop

Reply via email to