[ https://issues.apache.org/jira/browse/NUTCH-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
J. Gobel updated NUTCH-1511: ---------------------------- Description: After applying patch for Metadata parser (NUTCH-1478) I notice that the metadata field just before the crawl ends is populated with the correct information. However when the crawl is completely finished the metadata field is populated with 'garbage' _csh_����� last few lines of my logfile: p.s. I use : bin/nutch crawl urls -depth 1 -topN 5 .. 013-01-01 11:55:53,177 INFO crawl.SignatureFactory - Using Signature impl: org.apache.nutch.crawl.MD5Signature 2013-01-01 11:55:53,903 INFO parse.ParserJob - Parsing http://nutch.apache.com/ 2013-01-01 11:55:54,589 WARN parse.MetaTagsParser - Found meta tag : robots index, follow 2013-01-01 11:55:54,589 WARN parse.MetaTagsParser - Found meta tag : keywords .com.nl .net.nl com.nl net.nl sld, tld, domain, registry, domain registry, nic, extention, icann 2013-01-01 11:55:54,590 WARN parse.MetaTagsParser - Found meta tag : description Registreer nu uw .com.nl of .net.nl extentie. 2013-01-01 11:55:54,619 INFO regex.RegexURLNormalizer - can't find rules for scope 'outlink', using default 2013-01-01 11:55:55,240 WARN mapred.FileOutputCommitter - Output path is null in cleanup 2013-01-01 11:55:56,652 INFO mapreduce.GoraRecordReader - gora.buffer.read.limit = 10000 2013-01-01 11:55:59,574 INFO mapreduce.GoraRecordWriter - gora.buffer.write.limit = 10000 2013-01-01 11:55:59,575 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 2013-01-01 11:55:59,575 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 2013-01-01 11:55:59,575 INFO crawl.AbstractFetchSchedule - maxInterval=7776000 2013-01-01 11:56:02,554 WARN mapred.FileOutputCommitter - Output path is null in cleanup > Metadata in MYSQL updated with 'garbage' > ---------------------------------------- > > Key: NUTCH-1511 > URL: https://issues.apache.org/jira/browse/NUTCH-1511 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 2.1 > Environment: Ubuntu 12.04 > Reporter: J. Gobel > Labels: metadata, mysql, nutch > > After applying patch for Metadata parser (NUTCH-1478) I notice that the > metadata field just before the crawl ends is populated with the correct > information. However when the crawl is completely finished the metadata field > is populated with 'garbage' _csh_����� > last few lines of my logfile: > p.s. I use : bin/nutch crawl urls -depth 1 -topN 5 .. > 013-01-01 11:55:53,177 INFO crawl.SignatureFactory - Using Signature impl: > org.apache.nutch.crawl.MD5Signature > 2013-01-01 11:55:53,903 INFO parse.ParserJob - Parsing > http://nutch.apache.com/ > 2013-01-01 11:55:54,589 WARN parse.MetaTagsParser - Found meta tag : robots > index, follow > 2013-01-01 11:55:54,589 WARN parse.MetaTagsParser - Found meta tag : keywords > .com.nl .net.nl com.nl net.nl sld, tld, domain, registry, domain registry, > nic, extention, icann > 2013-01-01 11:55:54,590 WARN parse.MetaTagsParser - Found meta tag : > description Registreer nu uw .com.nl of .net.nl extentie. > 2013-01-01 11:55:54,619 INFO regex.RegexURLNormalizer - can't find rules for > scope 'outlink', using default > 2013-01-01 11:55:55,240 WARN mapred.FileOutputCommitter - Output path is null > in cleanup > 2013-01-01 11:55:56,652 INFO mapreduce.GoraRecordReader - > gora.buffer.read.limit = 10000 > 2013-01-01 11:55:59,574 INFO mapreduce.GoraRecordWriter - > gora.buffer.write.limit = 10000 > 2013-01-01 11:55:59,575 INFO crawl.FetchScheduleFactory - Using FetchSchedule > impl: org.apache.nutch.crawl.DefaultFetchSchedule > 2013-01-01 11:55:59,575 INFO crawl.AbstractFetchSchedule - > defaultInterval=2592000 > 2013-01-01 11:55:59,575 INFO crawl.AbstractFetchSchedule - maxInterval=7776000 > 2013-01-01 11:56:02,554 WARN mapred.FileOutputCommitter - Output path is null > in cleanup -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira