[ https://issues.apache.org/jira/browse/SPARK-13290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-13290. ------------------------------- Resolution: Not A Problem Yes, just reading a file length locally is going to be much much faster than reading into a distributed framework and serializing from Python to a different JVM, putting it in memory as a block, pulling it back out, serializing back to the Python process, and counting. I don't know that this is relevant. Spark is distributed; you would use this in a situation where there is no such thing as a local file to read with a local Python process. I don't yet see evidence of a bug, like something unreasonably slow relative to what it does. If you're not familiar with the code and can't get more specific, yeah I don't know if JIRA is the right place. Please don't reopen a JIRA unless the reason it was closed is materially changed. I'm going to reclose this. If you later see a specific way to optimize this you can comment again. > wholeTextFile and binaryFiles are really slow > --------------------------------------------- > > Key: SPARK-13290 > URL: https://issues.apache.org/jira/browse/SPARK-13290 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core > Affects Versions: 1.6.0 > Environment: Linux stand-alone > Reporter: mathieu longtin > > Reading biggish files (175MB) with wholeTextFile or binaryFiles is extremely > slow. It takes 3 minutes in Java versus 2.5 seconds in Python. > The java process balloons to 4.3GB of memory and uses 100% CPU the whole > time. I suspects Spark reads it in small chunks and assembles it at the end, > hence the large amount of CPU. > {code} > In [49]: rdd = sc.binaryFiles(pathToOneFile) > In [50]: %time path, text = rdd.first() > CPU times: user 1.91 s, sys: 1.13 s, total: 3.04 s > Wall time: 3min 32s > In [51]: len(text) > Out[51]: 191376122 > In [52]: %time text = open(pathToOneFile).read() > CPU times: user 8 ms, sys: 691 ms, total: 699 ms > Wall time: 2.43 s > In [53]: len(text) > Out[53]: 191376122 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org