mathieu longtin created SPARK-13290: ---------------------------------------
Summary: wholeTextFile and binaryFiles are really slow Key: SPARK-13290 URL: https://issues.apache.org/jira/browse/SPARK-13290 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 1.6.0 Environment: Linux stand-alone Reporter: mathieu longtin Reading biggish files (175MB) with wholeTextFile or binaryFiles is extremely slow. It takes 3 minutes in Java versus 2.5 seconds in Python. The java process balloons to 4.3GB of memory and uses 100% CPU the whole time. I suspects Spark reads it in small chunks and assembles it at the end, hence the large amount of CPU. {code} In [49]: rdd = sc.binaryFiles(pathToOneFile) In [50]: %time path, text = rdd.first() CPU times: user 1.91 s, sys: 1.13 s, total: 3.04 s Wall time: 3min 32s In [51]: len(text) Out[51]: 191376122 In [52]: %time text = open(pathToOneFile).read() CPU times: user 8 ms, sys: 691 ms, total: 699 ms Wall time: 2.43 s In [53]: len(text) Out[53]: 191376122 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org