[ 
https://issues.apache.org/jira/browse/CASSANALYTICS-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yifan Cai updated CASSANALYTICS-89:
-----------------------------------
    Change Category: Code Clarity
         Complexity: Normal
             Status: Open  (was: Triage Needed)

PR: https://github.com/apache/cassandra-analytics/pull/146
CI: 
https://app.circleci.com/pipelines/github/yifan-c/cassandra-analytics?branch=CASSANALYTICS-89%2Fbroadcast-var-refactor

> Create dedicated data class for broadcast variable during bulk write
> --------------------------------------------------------------------
>
>                 Key: CASSANALYTICS-89
>                 URL: https://issues.apache.org/jira/browse/CASSANALYTICS-89
>             Project: Apache Cassandra Analytics
>          Issue Type: Improvement
>          Components: Writer
>            Reporter: Yifan Cai
>            Assignee: Yifan Cai
>            Priority: Normal
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Bulk write in Analytics uses Spark’s broadcast variable feature to distribute 
> job information (BulkWriterContext) to executors. While this works today, it 
> triggers unnecessary work in Spark’s SizeEstimator, which inspects all 
> fields—including transient ones. Since BulkWriterContext (and the objects it 
> references) contain many transient fields that aren’t meant to be serialized, 
> SizeEstimator still walks them via reflection, wasting CPU cycles.
> A cleaner approach would be to introduce a dedicated data class for the 
> broadcast variable, with only the minimal set of fields required for 
> distribution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to