Yifan Cai created CASSANALYTICS-89:
--------------------------------------
Summary: Create dedicated data class for broadcast variable during
bulk write
Key: CASSANALYTICS-89
URL: https://issues.apache.org/jira/browse/CASSANALYTICS-89
Project: Apache Cassandra Analytics
Issue Type: Improvement
Components: Writer
Reporter: Yifan Cai
Bulk write in Analytics uses Spark’s broadcast variable feature to distribute
job information (BulkWriterContext) to executors. While this works today, it
triggers unnecessary work in Spark’s SizeEstimator, which inspects all
fields—including transient ones. Since BulkWriterContext (and the objects it
references) contain many transient fields that aren’t meant to be serialized,
SizeEstimator still walks them via reflection, wasting CPU cycles.
A cleaner approach would be to introduce a dedicated data class for the
broadcast variable, with only the minimal set of fields required for
distribution.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]