1. I don't think textFile is capable of unpacking a .gz file. You need to use hadoopFile or newAPIHadoop file for this.
2. Instead of map, do a mapPartitions 3. You need to open the driver UI and see what's really taking time. If that is running on a remote machine and you are not able to access from local, then create a ssh tunnel (ssh -L 4040:127.0.0.1:4040 user@remotemachine). Thanks Best Regards On Mon, Mar 16, 2015 at 1:39 PM, sergunok <ser...@gmail.com> wrote: > I have a 30GB gzip file (originally that is text file where each line > represents text document) in HDFS and Spark 1.2.0 under YARN cluster with 3 > worker nodes with 64GB RAM and 4 cores on each node. > Replictaion factor for my file is 3. > > I tried to implement simple pyspark script to parse this file and represent > it in tf-idf: > > Something like: > lines=sc.textFile('file.gz') > docs=lines.map(lambda: line.split(' ')) > > hashingTF=HashingTF() > tf=hashingTF.transform(docs) > > tf.cache() > > idf=IDF().fit(tf) > tfidf=idf.transform(tf) > > tfidf.map(lambda t: ' '.join([u'{}:{}'.format(t[0], t[1]) for t in > zip(t.indices, t.values)])) \ > .saveAsTextFile('tfidf.txt') > > I started the scipt with: > spark-submit --master yarn --num-executors 24 script.py > > No comments about why I selected 24 executors - that is just first try. > > > I saw in the output that all 24 executors and corresponding blockmanagers > with 0.5 GB on each of them were started on 3 nodes but output stops on > messages: > INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on node3:36765 > (size: 49.7 KB, free: 530.0 MB) > INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on node3:36765 > (size: 21.6 KB, free: 529.9 MB) > > I'm waiting already about 1 hour and don't see any changes. (Unfortunately > I > cannot monitor the cluster via Web UI) > > My main question is it generally speaking normal time of processing for > such > volume of data and such cluster? > Is it ok that output stops on "Added broadcast..."? > Is it ok to read gzip archive via sc.textFile(..) in code or is it better > to > unpack it before (from performance purposes)? > How to monitor Spark task via command line? > Please advise about some tuning. > > Thanks! > > Sergey. > > > > > > > > > > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Processing-of-text-file-in-large-gzip-archive-tp22073.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >