[GitHub] [datasketches-java] priyamtejaswin opened a new issue, #445: Problems with using Datasketches in Spark applications.

via GitHub Tue, 23 May 2023 13:28:41 -0700


priyamtejaswin opened a new issue, #445:
URL: https://github.com/apache/datasketches-java/issues/445


   Hi,
   
   I'm using ThetaSketches in my Spark application. I started by following the 
outline described in the [Example of using ThetaSketch in 
Spark](https://datasketches.apache.org/docs/Theta/ThetaSparkExample.html) 
documentation.
   
   Sketches become serializable through Java's `ObjectInputStream` and 
`ObjectOutputStream`. But since this is also used by Spark for its own 
serialization/deserialization (during shuffling, etc) I am hitting the size 
limit for the stream. The [limit is 
2GB](https://github.com/frohoff/jdk8u-jdk/blob/master/src/share/classes/java/io/ByteArrayOutputStream.java#L121),
 and is set by the jdk.
   
   I was wondering what other options exist for massively parallelizing 
Sketches inside Spark apps.
   
   Any thoughts, ideas are welcome. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [datasketches-java] priyamtejaswin opened a new issue, #445: Problems with using Datasketches in Spark applications.

Reply via email to