[ https://issues.apache.org/jira/browse/SPARK-29830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971727#comment-16971727 ]
Hyukjin Kwon commented on SPARK-29830: -------------------------------------- (Please avoid to set target version which is usually reserved for committers) > PySpark.context.Sparkcontext.binaryfiles improved memory with buffer > -------------------------------------------------------------------- > > Key: SPARK-29830 > URL: https://issues.apache.org/jira/browse/SPARK-29830 > Project: Spark > Issue Type: Improvement > Components: PySpark > Affects Versions: 2.4.4 > Reporter: Jörn Franke > Priority: Major > > At the moment, Pyspark reads binary files into a byte array directly. This > means it reads the full binary file immediately into memory, which is 1) > memory in-efficient 2) differs from the Scala implementation (see pyspark > here: > [https://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/context.html#SparkContext.binaryFiles). > > |https://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/context.html#SparkContext.binaryFiles] > In Scala, Spark returns a PortableDataStream, which means the application > does not need to read the full content of the stream in memory to work on it > (see > [https://spark.apache.org/docs/2.4.0/api/scala/index.html#org.apache.spark.SparkContext).] > > Hence, it is proposed to adapt the Pyspark implementation to return something > similar to a PortableDataStream in Scala (e.g. > [BytesIO|[https://docs.python.org/3/library/io.html#io.BytesIO].] > > Reading binary files in an efficient manner is crucial for many IoT > applications, but potentially also other fields (e.g. disk image analysis in > forensics). -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org