mathieu longtin created SPARK-13290:
---------------------------------------

             Summary: wholeTextFile and binaryFiles are really slow
                 Key: SPARK-13290
                 URL: https://issues.apache.org/jira/browse/SPARK-13290
             Project: Spark
          Issue Type: Bug
          Components: PySpark, Spark Core
    Affects Versions: 1.6.0
         Environment: Linux stand-alone
            Reporter: mathieu longtin


Reading biggish files (175MB) with wholeTextFile or binaryFiles is extremely 
slow. It takes 3 minutes in Java versus 2.5 seconds in Python.

The java process balloons to 4.3GB of memory and uses 100% CPU the whole time. 
I suspects Spark reads it in small chunks and assembles it at the end, hence 
the large amount of CPU.

{code}
In [49]: rdd = sc.binaryFiles(pathToOneFile)
In [50]: %time path, text = rdd.first()
CPU times: user 1.91 s, sys: 1.13 s, total: 3.04 s
Wall time: 3min 32s
In [51]: len(text)
Out[51]: 191376122
In [52]: %time text = open(pathToOneFile).read()
CPU times: user 8 ms, sys: 691 ms, total: 699 ms
Wall time: 2.43 s
In [53]: len(text)
Out[53]: 191376122
{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to