Lewis John McGibbney created NUTCH-3149:
-------------------------------------------

             Summary: Investigate Remote Shuffle Service Integration (Apache 
Uniffle / Celeborn) for Shuffle-Intensive Nutch Jobs
                 Key: NUTCH-3149
                 URL: https://issues.apache.org/jira/browse/NUTCH-3149
             Project: Nutch
          Issue Type: Improvement
          Components: shuffle
            Reporter: Lewis John McGibbney
            Assignee: Lewis John McGibbney


Several core jobs are shuffle-intensive, meaning they spend significant time 
and resources moving intermediate data between map and reduce phases. Remote 
shuffle services (such as Apache Uniffle and Apache Celeborn) offload this work 
to dedicated nodes, potentially improving:
 * Job execution time
 * Fault tolerance
 * Resource utilization
 * Scalability for large crawls

h4. Shuffle-Intensive Nutch Jobs Identified
||Job||Shuffle Intensity||Key Characteristics||
|LinkRank|Extreme|Iterative (10+ iterations), invert + analyze per iteration|
|WebGraph|Very High|3 sequential jobs (OutlinkDb → InlinkDb → NodeDb)|
|LinkDb|High|Full link inversion across segments|
|CrawlDb|High|URL grouping from multiple segments|
|Generator|Medium-High|Score-based selection + host/domain partitioning|
|Indexer|Medium-High|Multi-source aggregation (CrawlDb, LinkDb, segments)|
|Deduplication|Medium|Content signature-based grouping|
|Fetcher|Low|Primarily map-only, minimal shuffle|
h4. Key Metrics for Measuring Shuffle Intensity

Shuffle intensity can be determined from Hadoop's built-in TaskCounter and 
FileSystemCounter groups:
 * MAP_OUTPUT_BYTES / HDFS_BYTES_READ → Shuffle Ratio

 * SPILLED_RECORDS / MAP_OUTPUT_RECORDS → Spill Ratio

 * FILE_BYTES_READ + FILE_BYTES_WRITTEN → Local disk I/O overhead

Jobs with Shuffle Ratio > 2.0x or Spill Ratio > 2.0x are strong candidates for 
remote shuffle optimization.

An initial step would be to implement a configurable (off/off switch) reporting 
capability to write shuffle reports to the Nutch log. This will enable Nutch 
admins to determine over time if a remote shuffle candidate would benefit their 
Nutch jobs.

A future ticket will establish an experiment (comparative analysis) for Nutch 
shuffle data between Hadoop native (no changes), Uniffle and Celeborn.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to