[jira] Updated: (NUTCH-506) Nutch should delegate compression to Hadoop
[ https://issues.apache.org/jira/browse/NUTCH-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-506: Attachment: NUTCH-506.patch New version. I missed ProtocolStatus and ParseStatus. This patch updates them in a backward-compatible way. Nutch should delegate compression to Hadoop --- Key: NUTCH-506 URL: https://issues.apache.org/jira/browse/NUTCH-506 Project: Nutch Issue Type: Improvement Reporter: Doğacan Güney Fix For: 1.0.0 Attachments: compress.patch, NUTCH-506.patch Some data structures within nutch (such as Content, ParseText) handle their own compression. We should delegate all compressions to Hadoop. Also, nutch should respect io.seqfile.compression.type setting. Currently even if io.seqfile.compression.type is BLOCK or RECORD, nutch overrides it for some structures and sets it to NONE (However, IMO, ParseText should always be compressed as RECORD because of performance reasons). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-506) Nutch should delegate compression to Hadoop
[ https://issues.apache.org/jira/browse/NUTCH-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-506: Attachment: compress.patch This patch changes Content (Content is no longer a CompressedWritable) and ParseText (from VersionedWritable(*) to Writable). These changes are backwards compatible. So old segments can still be read after this patch. Patch also changes Content's public api very slightly. Content.forceInflate method is removed because it is no longer needed. Nutch should delegate compression to Hadoop --- Key: NUTCH-506 URL: https://issues.apache.org/jira/browse/NUTCH-506 Project: Nutch Issue Type: Improvement Reporter: Doğacan Güney Fix For: 1.0.0 Attachments: compress.patch Some data structures within nutch (such as Content, ParseText) handle their own compression. We should delegate all compressions to Hadoop. Also, nutch should respect io.seqfile.compression.type setting. Currently even if io.seqfile.compression.type is BLOCK or RECORD, nutch overrides it for some structures and sets it to NONE (However, IMO, ParseText should always be compressed as RECORD because of performance reasons). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.