[
https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508861
]
Doğacan Güney commented on NUTCH-392:
-------------------------------------
After changing ParseText to not do any internal compression, segment directory
looks like this:
828M crawl/segments/20070626163143/content
35M crawl/segments/20070626163143/crawl_fetch
23M crawl/segments/20070626163143/crawl_generate
44M crawl/segments/20070626163143/crawl_parse # BLOCK compression
218M crawl/segments/20070626163143/parse_data
524M crawl/segments/20070626163143/parse_text
192M crawl/segments/20070626163143/parse_text_block
242M crawl/segments/20070626163143/parse_text_record
As you can see parse_text_block is around %20 percent smaller than
parse_text_record.
I also wrote a simple benchmark that randomly requests n urls from each parse
text sequentially (but it requests the same urls in the same order from all
parse texts). All parse texts contain a single part with ~250K urls. Here are
the results (Trial 0 is NONE, trial is RECORD, trial 2 is BLOCK):
for n = 1000:
Trial 0 has taken 9947 ms.
Trial 1 has taken 6794 ms.
Trial 2 has taken 9717 ms.
for n = 5000:
Trial 0 has taken 40918 ms.
Trial 1 has taken 19969 ms.
Trial 2 has taken 52622 ms.
for n = 10000
Trial 0 has taken 57622 ms.
Trial 1 has taken 24291 ms.
Trial 2 has taken 96292 ms.
Overall RECORD compression is the fastest and BLOCK compression is the slowest
(by a large margin).
Assuming my benchmark code is correct (feel free to show me where it is wrong),
these are my conclusions:
* I don't know what others think, but to me it still looks like we can use
BLOCK compression for structures like content, linkdb, etc. Even though, it is
much slower than RECORD, it can still serve ~100 parse texts per second. While,
this is certainly not good enough for parse text, it probably is good enough
for others.
* We should definitely enable RECORD compression for parse text and BLOCK
compression for crawl_*. For some reason, RECORD compression performs better
than O(n) (which makes me think that something is wrong with my benchmark code)
for parse text.
* Nutch should not do any compression internally. Hadoop can do this better
with its native compression. Content and ParseText compress their data on their
own (and they can be converted to hadoop's compression in a backward-compatible
way). I don't know if anything else does compression.
PS: Native hadoop library is loaded. I haven't specified which compression
codec to use so I guess it uses zlib. Lzo results would have probably been
better.
> OutputFormat implementations should pass on Progressable
> --------------------------------------------------------
>
> Key: NUTCH-392
> URL: https://issues.apache.org/jira/browse/NUTCH-392
> Project: Nutch
> Issue Type: New Feature
> Components: fetcher
> Reporter: Doug Cutting
> Assignee: Andrzej Bialecki
> Fix For: 1.0.0
>
> Attachments: NUTCH-392.patch, ParseTextBenchmark.java
>
>
> OutputFormat implementations should pass the Progressable they are passed to
> underlying SequenceFile implementations. This will keep reduce tasks from
> timing out when block writes are slow. This issue depends on
> http://issues.apache.org/jira/browse/HADOOP-636.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers