1. I don't think textFile is capable of unpacking a .gz file. You need to use
hadoopFile or newAPIHadoop file for this.
Sorry that’s incorrect, textFile works fine on .gz files. What it can’t do is
compute splits on gz files, so if you have a single file, you'll have a single
partition.
You probably want to update this line as follows:
lines = sc.textFile('file.gz').repartition(sc.defaultParallelism * 3)
For more details on why, see this answer
http://stackoverflow.com/a/27631722/877069.
Nick
On Mon, Mar 16, 2015 at 6:50 AM Marius Soutier mps@gmail.com wrote:
1. I
(..) in code or is it better to
unpack it before (from performance purposes)?
How to monitor Spark task via command line?
Please advise about some tuning.
Thanks!
Sergey.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Processing-of-text-file
to monitor Spark task via command line?
Please advise about some tuning.
Thanks!
Sergey.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Processing-of-text-file-in-large-gzip-archive-tp22073.html
Sent from the Apache Spark User List mailing list