[jira] [Commented] (NUTCH-2172) Parsing whitespace not just tabs in contenttype-mapping.txt
[ https://issues.apache.org/jira/browse/NUTCH-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035902#comment-15035902 ] Chris A. Mattmann commented on NUTCH-2172: -- bq. This could be an improvement if we assume that MIME types do not contain white space This is not a safe assumption on the Internet. We see all the time in crawls that web servers return MIME type with white space. > Parsing whitespace not just tabs in contenttype-mapping.txt > --- > > Key: NUTCH-2172 > URL: https://issues.apache.org/jira/browse/NUTCH-2172 > Project: Nutch > Issue Type: Bug > Components: metadata >Affects Versions: 1.10 > Environment: Macosx, Java 8 >Reporter: Nicola Tonellotto >Priority: Minor > Labels: easyfix, newbie > Attachments: NUTCH-2172-1.patch > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > The index-more plugin uses the conf/contenttype-mapping.txt file to build up > the mimeMap hash table (in the readConfiguration() method). > The line splitting is performed around "\t", so it silently skip lines > separated by simple spaces or more than one tab (see line 325). > Changing the single-char string "\t" with the regex "\\s+" should do the > magic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2176) Clean up of log4j.properties
[ https://issues.apache.org/jira/browse/NUTCH-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035746#comment-15035746 ] Hudson commented on NUTCH-2176: --- SUCCESS: Integrated in Nutch-trunk #3320 (See [https://builds.apache.org/job/Nutch-trunk/3320/]) NUTCH-2176 Clean up of log4j.properties (markus: [http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1717622]) * trunk/CHANGES.txt * trunk/conf/log4j.properties > Clean up of log4j.properties > > > Key: NUTCH-2176 > URL: https://issues.apache.org/jira/browse/NUTCH-2176 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.10 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Trivial > Fix For: 1.11 > > Attachments: NUTCH-2176.patch > > > Properties file: > - missing DeduplicationJob > - still has CrawldbScanner > - still has reverted HostDB stuff > - is not sorted -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-2176) Clean up of log4j.properties
[ https://issues.apache.org/jira/browse/NUTCH-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2176. -- Resolution: Fixed Committed to trunk in rev. 1717622. > Clean up of log4j.properties > > > Key: NUTCH-2176 > URL: https://issues.apache.org/jira/browse/NUTCH-2176 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.10 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Trivial > Fix For: 1.11 > > Attachments: NUTCH-2176.patch > > > Properties file: > - missing DeduplicationJob > - still has CrawldbScanner > - still has reverted HostDB stuff > - is not sorted -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2176) Clean up of log4j.properties
[ https://issues.apache.org/jira/browse/NUTCH-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2176: - Summary: Clean up of log4j.properties (was: log4j.properties is a mess) > Clean up of log4j.properties > > > Key: NUTCH-2176 > URL: https://issues.apache.org/jira/browse/NUTCH-2176 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.10 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Trivial > Fix For: 1.11 > > Attachments: NUTCH-2176.patch > > > Properties file: > - missing DeduplicationJob > - still has CrawldbScanner > - still has reverted HostDB stuff > - is not sorted -- This message was sent by Atlassian JIRA (v6.3.4#6332)