OutlinkExtractor extremely slow on some non-plain text
------------------------------------------------------

         Key: NUTCH-150
         URL: http://issues.apache.org/jira/browse/NUTCH-150
     Project: Nutch
        Type: Bug
    Versions: 0.8-dev    
 Environment: All
    Reporter: Paul Baclace
    Priority: Minor


While using mime settings which aggressively parsed everything by default, 
rather than having conf/parse-plugins.xml  associate parse-default with *, some 
parse tasks took an incredibly long time to finish.  For instance, a single 
postscript file took 9 hours to parse.  Stacktraces indicated this to be a 
problem with OutlinkExtractor.getOutlinks(...) during the call to reg expr 
match().  

Analysis:  The regular expression matching in OutlinkExtractor.getOutlinks(...) 
encounters parasitic cases which have extremely long runtimes when 
non-plain-text is processed.

Workaround 1:  Avoid treating non-plain-text, especially postscript files, as 
text or html.

Workaround 2:  kill -SIGQUIT  the child TaskRunner process, this will interrupt 
the match() and the process will continue.  This might need to be done multiple 
times.  (In theory, SIGQUIT is not supposed to do this, but in practice it 
does.)



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to