[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508900 ] Andrzej Bialecki commented on NUTCH-392: - Excellent work, Doğacan - thank you. The numbers for RECORD compression probably depend on some sweet spot in the environment, related to the CPU usage, how the OS pulls data from the disk / disk buffers, what is the hard drive cache, what is the size of internal mem buffers in Hadoop, etc, etc. I would venture a guess that compression NONE is raw disk I/O bound, whereas BOCK compression suffers from poor performance of seeking in compressed data. I agree with your conclusions regarding the type of compression to use for each segment part. Re: Nutch not doing any internal compression for Content and ParseText: Content is a versioned writable, so we can change its implementation and provide compatibility code to read older data. The same with ParseText. > OutputFormat implementations should pass on Progressable > > > Key: NUTCH-392 > URL: https://issues.apache.org/jira/browse/NUTCH-392 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Reporter: Doug Cutting >Assignee: Andrzej Bialecki > Fix For: 1.0.0 > > Attachments: NUTCH-392.patch, ParseTextBenchmark.java > > > OutputFormat implementations should pass the Progressable they are passed to > underlying SequenceFile implementations. This will keep reduce tasks from > timing out when block writes are slow. This issue depends on > http://issues.apache.org/jira/browse/HADOOP-636. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508861 ] Doğacan Güney commented on NUTCH-392: - After changing ParseText to not do any internal compression, segment directory looks like this: 828Mcrawl/segments/20070626163143/content 35M crawl/segments/20070626163143/crawl_fetch 23M crawl/segments/20070626163143/crawl_generate 44M crawl/segments/20070626163143/crawl_parse # BLOCK compression 218Mcrawl/segments/20070626163143/parse_data 524Mcrawl/segments/20070626163143/parse_text 192Mcrawl/segments/20070626163143/parse_text_block 242Mcrawl/segments/20070626163143/parse_text_record As you can see parse_text_block is around %20 percent smaller than parse_text_record. I also wrote a simple benchmark that randomly requests n urls from each parse text sequentially (but it requests the same urls in the same order from all parse texts). All parse texts contain a single part with ~250K urls. Here are the results (Trial 0 is NONE, trial is RECORD, trial 2 is BLOCK): for n = 1000: Trial 0 has taken 9947 ms. Trial 1 has taken 6794 ms. Trial 2 has taken 9717 ms. for n = 5000: Trial 0 has taken 40918 ms. Trial 1 has taken 19969 ms. Trial 2 has taken 52622 ms. for n = 1 Trial 0 has taken 57622 ms. Trial 1 has taken 24291 ms. Trial 2 has taken 96292 ms. Overall RECORD compression is the fastest and BLOCK compression is the slowest (by a large margin). Assuming my benchmark code is correct (feel free to show me where it is wrong), these are my conclusions: * I don't know what others think, but to me it still looks like we can use BLOCK compression for structures like content, linkdb, etc. Even though, it is much slower than RECORD, it can still serve ~100 parse texts per second. While, this is certainly not good enough for parse text, it probably is good enough for others. * We should definitely enable RECORD compression for parse text and BLOCK compression for crawl_*. For some reason, RECORD compression performs better than O(n) (which makes me think that something is wrong with my benchmark code) for parse text. * Nutch should not do any compression internally. Hadoop can do this better with its native compression. Content and ParseText compress their data on their own (and they can be converted to hadoop's compression in a backward-compatible way). I don't know if anything else does compression. PS: Native hadoop library is loaded. I haven't specified which compression codec to use so I guess it uses zlib. Lzo results would have probably been better. > OutputFormat implementations should pass on Progressable > > > Key: NUTCH-392 > URL: https://issues.apache.org/jira/browse/NUTCH-392 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Reporter: Doug Cutting >Assignee: Andrzej Bialecki > Fix For: 1.0.0 > > Attachments: NUTCH-392.patch, ParseTextBenchmark.java > > > OutputFormat implementations should pass the Progressable they are passed to > underlying SequenceFile implementations. This will keep reduce tasks from > timing out when block writes are slow. This issue depends on > http://issues.apache.org/jira/browse/HADOOP-636. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508823 ] Doğacan Güney commented on NUTCH-392: - > data of parse_text is already compressed so recompressing it does not give > huge gains Wow, I am certainly not at my sharpest today. Thanks for pointing out. I will change ParseText and report back with the results. > OutputFormat implementations should pass on Progressable > > > Key: NUTCH-392 > URL: https://issues.apache.org/jira/browse/NUTCH-392 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Reporter: Doug Cutting >Assignee: Andrzej Bialecki > Fix For: 1.0.0 > > Attachments: NUTCH-392.patch > > > OutputFormat implementations should pass the Progressable they are passed to > underlying SequenceFile implementations. This will keep reduce tasks from > timing out when block writes are slow. This issue depends on > http://issues.apache.org/jira/browse/HADOOP-636. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508820 ] Sami Siren commented on NUTCH-392: -- > But why is parse_text_block's size so close to parse_text data of parse_text is already compressed so recompressing it does not give huge gains > OutputFormat implementations should pass on Progressable > > > Key: NUTCH-392 > URL: https://issues.apache.org/jira/browse/NUTCH-392 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Reporter: Doug Cutting >Assignee: Andrzej Bialecki > Fix For: 1.0.0 > > Attachments: NUTCH-392.patch > > > OutputFormat implementations should pass the Progressable they are passed to > underlying SequenceFile implementations. This will keep reduce tasks from > timing out when block writes are slow. This issue depends on > http://issues.apache.org/jira/browse/HADOOP-636. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508818 ] Doğacan Güney commented on NUTCH-392: - > Re: Content versioning - we can use negative int values as version numbers. > I'm still not sure what is the impact of > BLOCK compression on MapFile random access. Good idea! (Btw, I still believe that BLOCK compression's performance hit is irrelevant for anything but parse_text. That's why I am trying to do the second test. I was trying to test how fast random access on parse_text is under different compressions. BLOCK compression will probably be not fast enough for parse_text. But if the impact is minor, it can be used for everything else.) > Regarding the sizes: parse_text_record size is larger, because for small > chunks of data the compression overhead may far > outweigh the compression gains. Re: the large size of crawl_parse - is this > related to your patch? It could be simply related to > the fact that there are many outlinks in those pages ... Or is crawl_parse > using BLOCK compression too? OK, I understand why parse_text_record is larger, thanks for the explanation. But why is parse_text_block's size so close to parse_text (why is content so different from parse_text? BLOCK creates wonders in content but does not even give a 10% in parse_text.). Feed plugin wasn't enabled so my patch shouldn't matter. Also, crawl_parse is using NONE compression. > OutputFormat implementations should pass on Progressable > > > Key: NUTCH-392 > URL: https://issues.apache.org/jira/browse/NUTCH-392 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Reporter: Doug Cutting >Assignee: Andrzej Bialecki > Fix For: 1.0.0 > > Attachments: NUTCH-392.patch > > > OutputFormat implementations should pass the Progressable they are passed to > underlying SequenceFile implementations. This will keep reduce tasks from > timing out when block writes are slow. This issue depends on > http://issues.apache.org/jira/browse/HADOOP-636. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508816 ] Andrzej Bialecki commented on NUTCH-392: - Re: Content versioning - we can use negative int values as version numbers. I'm still not sure what is the impact of BLOCK compression on MapFile random access. Regarding the sizes: parse_text_record size is larger, because for small chunks of data the compression overhead may far outweigh the compression gains. Re: the large size of crawl_parse - is this related to your patch? It could be simply related to the fact that there are many outlinks in those pages ... Or is crawl_parse using BLOCK compression too? > OutputFormat implementations should pass on Progressable > > > Key: NUTCH-392 > URL: https://issues.apache.org/jira/browse/NUTCH-392 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Reporter: Doug Cutting >Assignee: Andrzej Bialecki > Fix For: 1.0.0 > > Attachments: NUTCH-392.patch > > > OutputFormat implementations should pass the Progressable they are passed to > underlying SequenceFile implementations. This will keep reduce tasks from > timing out when block writes are slow. This issue depends on > http://issues.apache.org/jira/browse/HADOOP-636. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508812 ] Doğacan Güney commented on NUTCH-392: - OK, I have done a bit of testing on compression but I'm stuck. Here it is: * I changed Content to be a regular Writable instead of a CompressedWritable and turned on BLOCK compression. Results were pretty impressive. Content size went down from ~1GB to ~500MB. Unfortunately, I haven't figured out how we can change Content in a backward compatible way. Reading first byte as version won't work (because first byte is not version, the first thing written is the size of the compressed data as int). * This is where it gets strange. I was trying to test the performance impact of BLOCK compression (when generating summaries). I fetched a sample 25 url segment (a subset of dmoz). Then I made a small modification to ParseOutputFormat so that it outputs parse_text in all three compression formats ( http://www.ceng.metu.edu.tr/~e1345172/comp_parse.patch ). After parsing, segment looks like this: 828Mcrawl/segments/20070626163143/content 35M crawl/segments/20070626163143/crawl_fetch 23M crawl/segments/20070626163143/crawl_generate 345Mcrawl/segments/20070626163143/crawl_parse 196Mcrawl/segments/20070626163143/parse_data 244Mcrawl/segments/20070626163143/parse_text # NONE 232Mcrawl/segments/20070626163143/parse_text_block # BLOCK 246Mcrawl/segments/20070626163143/parse_text_record # RECORD Not only parse_text_record is larger than parse_text and parse_text_block is only slightly smaller, but also crawl_parse is larger than any of them! I probably messed up somewhere and I can't see it. Any help would be welcome. > OutputFormat implementations should pass on Progressable > > > Key: NUTCH-392 > URL: https://issues.apache.org/jira/browse/NUTCH-392 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Reporter: Doug Cutting >Assignee: Andrzej Bialecki > Fix For: 1.0.0 > > Attachments: NUTCH-392.patch > > > OutputFormat implementations should pass the Progressable they are passed to > underlying SequenceFile implementations. This will keep reduce tasks from > timing out when block writes are slow. This issue depends on > http://issues.apache.org/jira/browse/HADOOP-636. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500951 ] Andrzej Bialecki commented on NUTCH-392: - I don't think it's a good idea, it's creating too many cryptic options ... Average users won't be able to assess what are the best choices there, and advanced users are able to change this directly in the source anyway ... > OutputFormat implementations should pass on Progressable > > > Key: NUTCH-392 > URL: https://issues.apache.org/jira/browse/NUTCH-392 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Reporter: Doug Cutting >Assignee: Andrzej Bialecki > Fix For: 1.0.0 > > Attachments: NUTCH-392.patch > > > OutputFormat implementations should pass the Progressable they are passed to > underlying SequenceFile implementations. This will keep reduce tasks from > timing out when block writes are slow. This issue depends on > http://issues.apache.org/jira/browse/HADOOP-636. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500935 ] Doğacan Güney commented on NUTCH-392: - Perhaps we can allow a user to configure this on a per-structure basis by adding new properties: compression.type.{parse_text,crawldb,parse_data,linkdb} or whatever. Then we can make such a property take one of the 4 valid values: BLOCK, NONE, RECORD, DEFAULT where DEFAULT is the value of io.sequence.file.compression. > OutputFormat implementations should pass on Progressable > > > Key: NUTCH-392 > URL: https://issues.apache.org/jira/browse/NUTCH-392 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Reporter: Doug Cutting >Assignee: Andrzej Bialecki > Fix For: 1.0.0 > > Attachments: NUTCH-392.patch > > > OutputFormat implementations should pass the Progressable they are passed to > underlying SequenceFile implementations. This will keep reduce tasks from > timing out when block writes are slow. This issue depends on > http://issues.apache.org/jira/browse/HADOOP-636. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500822 ] Doug Cutting commented on NUTCH-392: Anchors, explain, and the cache are used relatively infrequently, considerably less than once per query, and hence *much* less than once per displayed hit. So it might be acceptable if they're somewhat slower. Block compression should still be fast-enough for interactive use, and these uses would never dominate CPU use in an application, would they? > OutputFormat implementations should pass on Progressable > > > Key: NUTCH-392 > URL: https://issues.apache.org/jira/browse/NUTCH-392 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Reporter: Doug Cutting >Assignee: Andrzej Bialecki > Fix For: 1.0.0 > > Attachments: NUTCH-392.patch > > > OutputFormat implementations should pass the Progressable they are passed to > underlying SequenceFile implementations. This will keep reduce tasks from > timing out when block writes are slow. This issue depends on > http://issues.apache.org/jira/browse/HADOOP-636. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500728 ] Andrzej Bialecki commented on NUTCH-392: - > I think it is okay to allow BLOCK compression for linkdb, crawldb, crawl_*, > content, parse_data. Because I don't think that people will need fast > random-access > on anything but parse_text. LinkDb is accessed on-line randomly through LinkDbInlinks, when users request anchors. Similarly, parse_data is accessed when requesting "explain", and may be also accessed to retrieve other hit metadata. Content is accessed randomly when displaying cached preview. I think in all these cases we can use at most RECORD compression, or NONE. > OutputFormat implementations should pass on Progressable > > > Key: NUTCH-392 > URL: https://issues.apache.org/jira/browse/NUTCH-392 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Reporter: Doug Cutting >Assignee: Andrzej Bialecki > Fix For: 1.0.0 > > Attachments: NUTCH-392.patch > > > OutputFormat implementations should pass the Progressable they are passed to > underlying SequenceFile implementations. This will keep reduce tasks from > timing out when block writes are slow. This issue depends on > http://issues.apache.org/jira/browse/HADOOP-636. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500665 ] Doğacan Güney commented on NUTCH-392: - I think it is okay to allow BLOCK compression for linkdb, crawldb, crawl_*, content, parse_data. Because I don't think that people will need fast random-access on anything but parse_text. I agree that we need to test performance impact of BLOCK compression before committing such a change. Unfortunately, our setup doesn't include BLOCK compression right now. I will try to test it and report some results once I get the chance. PS: Compressing content will not have significant savings right now since it is already compressed internally but once content stops doing that I think there will be _huge_ savings there. > OutputFormat implementations should pass on Progressable > > > Key: NUTCH-392 > URL: https://issues.apache.org/jira/browse/NUTCH-392 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Reporter: Doug Cutting >Assignee: Andrzej Bialecki > Fix For: 1.0.0 > > Attachments: NUTCH-392.patch > > > OutputFormat implementations should pass the Progressable they are passed to > underlying SequenceFile implementations. This will keep reduce tasks from > timing out when block writes are slow. This issue depends on > http://issues.apache.org/jira/browse/HADOOP-636. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500635 ] Andrzej Bialecki commented on NUTCH-392: - Good point. We can change it to use the following pattern (as Hadoop uses internally), e.g.: contentOut = new MapFile.Writer(job, fs, content.toString(), Text.class, Content.class, SequenceFile.getCompressionType(job), progress); However, the original patch had some merits, too. Some types of data are not that compressible in themselves (using RECORD compression), i.e. it takes more effort to compress/decompress than space savings are worth. In case of crawl_parse and crawl_fetch it would make sense to enforce BLOCK or NONE compression type, and disallow the RECORD type. I know that BLOCK compression gives a better space savings, and incidentally may increase the writing speed. But I'm not sure what is the performance impact of using BLOCK compressed MapFile-s when doing random reading - this is the scenario in LinkDbInlinks, FetchedSegments and similar places. Could you perhaps test it? The original patch used RECORD compression for MapFile-s, probably for this reason. > OutputFormat implementations should pass on Progressable > > > Key: NUTCH-392 > URL: https://issues.apache.org/jira/browse/NUTCH-392 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Reporter: Doug Cutting >Assignee: Andrzej Bialecki > Fix For: 1.0.0 > > Attachments: NUTCH-392.patch > > > OutputFormat implementations should pass the Progressable they are passed to > underlying SequenceFile implementations. This will keep reduce tasks from > timing out when block writes are slow. This issue depends on > http://issues.apache.org/jira/browse/HADOOP-636. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500603 ] Doğacan Güney commented on NUTCH-392: - >From what I understand of MapFile.Writer code in hadoop, if you give >CompressionType as an argument in its constructor it overwrites the >compression value in config. So since nutch manually sets parse_text and >parse_data to RECORD compression ( and crawl_parse to NONE), we will not get >the advantages of BLOCK compression even if we set it in config. BLOCK compression seems to work really great if you got the native libraries in place, so IMHO it would be better to not manually set CompressionType and allow people to set it to whatever they want in config. > OutputFormat implementations should pass on Progressable > > > Key: NUTCH-392 > URL: https://issues.apache.org/jira/browse/NUTCH-392 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Reporter: Doug Cutting >Assignee: Andrzej Bialecki > Fix For: 1.0.0 > > Attachments: NUTCH-392.patch > > > OutputFormat implementations should pass the Progressable they are passed to > underlying SequenceFile implementations. This will keep reduce tasks from > timing out when block writes are slow. This issue depends on > http://issues.apache.org/jira/browse/HADOOP-636. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable
[ http://issues.apache.org/jira/browse/NUTCH-392?page=comments#action_12444719 ] Doug Cutting commented on NUTCH-392: This should not be applied until Nutch uses Hadoop 0.8. It also contains a patch required to make Nutch work correctly with Hadoop 0.8 (where LocalFileSystem.rename() of a non-existing file now throws an exception). > OutputFormat implementations should pass on Progressable > > > Key: NUTCH-392 > URL: http://issues.apache.org/jira/browse/NUTCH-392 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Reporter: Doug Cutting > Assigned To: Doug Cutting > Attachments: NUTCH-392.patch > > > OutputFormat implementations should pass the Progressable they are passed to > underlying SequenceFile implementations. This will keep reduce tasks from > timing out when block writes are slow. This issue depends on > http://issues.apache.org/jira/browse/HADOOP-636. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira