Re: Processing of text file in large gzip archive

2015-03-16 Thread Marius Soutier
1. I don't think textFile is capable of unpacking a .gz file. You need to use hadoopFile or newAPIHadoop file for this. Sorry that’s incorrect, textFile works fine on .gz files. What it can’t do is compute splits on gz files, so if you have a single file, you'll have a single partition.

Re: Processing of text file in large gzip archive

2015-03-16 Thread Nicholas Chammas
You probably want to update this line as follows: lines = sc.textFile('file.gz').repartition(sc.defaultParallelism * 3) For more details on why, see this answer http://stackoverflow.com/a/27631722/877069. Nick ​ On Mon, Mar 16, 2015 at 6:50 AM Marius Soutier mps@gmail.com wrote: 1. I

Processing of text file in large gzip archive

2015-03-16 Thread sergunok
(..) in code or is it better to unpack it before (from performance purposes)? How to monitor Spark task via command line? Please advise about some tuning. Thanks! Sergey. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Processing-of-text-file

Re: Processing of text file in large gzip archive

2015-03-16 Thread Akhil Das
to monitor Spark task via command line? Please advise about some tuning. Thanks! Sergey. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Processing-of-text-file-in-large-gzip-archive-tp22073.html Sent from the Apache Spark User List mailing list