[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500635 ]
Andrzej Bialecki commented on NUTCH-392: ----------------------------------------- Good point. We can change it to use the following pattern (as Hadoop uses internally), e.g.: contentOut = new MapFile.Writer(job, fs, content.toString(), Text.class, Content.class, SequenceFile.getCompressionType(job), progress); However, the original patch had some merits, too. Some types of data are not that compressible in themselves (using RECORD compression), i.e. it takes more effort to compress/decompress than space savings are worth. In case of crawl_parse and crawl_fetch it would make sense to enforce BLOCK or NONE compression type, and disallow the RECORD type. I know that BLOCK compression gives a better space savings, and incidentally may increase the writing speed. But I'm not sure what is the performance impact of using BLOCK compressed MapFile-s when doing random reading - this is the scenario in LinkDbInlinks, FetchedSegments and similar places. Could you perhaps test it? The original patch used RECORD compression for MapFile-s, probably for this reason. > OutputFormat implementations should pass on Progressable > -------------------------------------------------------- > > Key: NUTCH-392 > URL: https://issues.apache.org/jira/browse/NUTCH-392 > Project: Nutch > Issue Type: New Feature > Components: fetcher > Reporter: Doug Cutting > Assignee: Andrzej Bialecki > Fix For: 1.0.0 > > Attachments: NUTCH-392.patch > > > OutputFormat implementations should pass the Progressable they are passed to > underlying SequenceFile implementations. This will keep reduce tasks from > timing out when block writes are slow. This issue depends on > http://issues.apache.org/jira/browse/HADOOP-636. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.