Lewis John McGibbney created NUTCH-3149:
-------------------------------------------
Summary: Investigate Remote Shuffle Service Integration (Apache
Uniffle / Celeborn) for Shuffle-Intensive Nutch Jobs
Key: NUTCH-3149
URL: https://issues.apache.org/jira/browse/NUTCH-3149
Project: Nutch
Issue Type: Improvement
Components: shuffle
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Several core jobs are shuffle-intensive, meaning they spend significant time
and resources moving intermediate data between map and reduce phases. Remote
shuffle services (such as Apache Uniffle and Apache Celeborn) offload this work
to dedicated nodes, potentially improving:
* Job execution time
* Fault tolerance
* Resource utilization
* Scalability for large crawls
h4. Shuffle-Intensive Nutch Jobs Identified
||Job||Shuffle Intensity||Key Characteristics||
|LinkRank|Extreme|Iterative (10+ iterations), invert + analyze per iteration|
|WebGraph|Very High|3 sequential jobs (OutlinkDb → InlinkDb → NodeDb)|
|LinkDb|High|Full link inversion across segments|
|CrawlDb|High|URL grouping from multiple segments|
|Generator|Medium-High|Score-based selection + host/domain partitioning|
|Indexer|Medium-High|Multi-source aggregation (CrawlDb, LinkDb, segments)|
|Deduplication|Medium|Content signature-based grouping|
|Fetcher|Low|Primarily map-only, minimal shuffle|
h4. Key Metrics for Measuring Shuffle Intensity
Shuffle intensity can be determined from Hadoop's built-in TaskCounter and
FileSystemCounter groups:
* MAP_OUTPUT_BYTES / HDFS_BYTES_READ → Shuffle Ratio
* SPILLED_RECORDS / MAP_OUTPUT_RECORDS → Spill Ratio
* FILE_BYTES_READ + FILE_BYTES_WRITTEN → Local disk I/O overhead
Jobs with Shuffle Ratio > 2.0x or Spill Ratio > 2.0x are strong candidates for
remote shuffle optimization.
An initial step would be to implement a configurable (off/off switch) reporting
capability to write shuffle reports to the Nutch log. This will enable Nutch
admins to determine over time if a remote shuffle candidate would benefit their
Nutch jobs.
A future ticket will establish an experiment (comparative analysis) for Nutch
shuffle data between Hadoop native (no changes), Uniffle and Celeborn.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)