Re: [PR] feat: Enable Comet broadcast by default [arrow-datafusion-comet]

via GitHub Wed, 03 Apr 2024 09:47:06 -0700


viirya commented on code in PR #213:
URL: 
https://github.com/apache/arrow-datafusion-comet/pull/213#discussion_r1550100500



##########
common/src/main/scala/org/apache/spark/sql/comet/util/Utils.scala:
##########
@@ -161,4 +173,84 @@ object Utils {
       toArrowField(field.name, field.dataType, field.nullable, timeZoneId)
     }.asJava)
   }
+
+  /**
+   * Serializes a list of `ColumnarBatch` into an output stream. This method 
must be in `spark`
+   * package because `ChunkedByteBufferOutputStream` is spark private class. 
As it uses Arrow
+   * classes, it must be in `common` module.
+   *
+   * @param batches
+   *   the output batches, each batch is a list of Arrow vectors wrapped in 
`CometVector`
+   * @param out
+   *   the output stream
+   */
+  def serializeBatches(batches: Iterator[ColumnarBatch]): Iterator[(Long, 
ChunkedByteBuffer)] = {
+    batches.map { batch =>
+      val dictionaryProvider: CDataDictionaryProvider = new 
CDataDictionaryProvider
+
+      val codec = CompressionCodec.createCodec(SparkEnv.get.conf)
+      val cbbos = new ChunkedByteBufferOutputStream(1024 * 1024, 
ByteBuffer.allocate)

Review Comment:
   I need to move `serializeBatches` into `spark` package because 
`ChunkedByteBufferOutputStream` is a spark private class. I cannot move 
`serializeBatches` to `spark` module because it uses arrow packages (we shade 
arrow in `common` module).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat: Enable Comet broadcast by default [arrow-datafusion-comet]

Reply via email to