I have a 30GB gzip file (originally that is text file where each line
represents text document) in HDFS and Spark 1.2.0 under YARN cluster with 3
worker nodes with 64GB RAM and 4 cores on each node.
Replictaion factor for my file is 3.

I tried to implement simple pyspark script to parse this file and represent
it in tf-idf:

Something like:
lines=sc.textFile('file.gz')
docs=lines.map(lambda: line.split(' '))

hashingTF=HashingTF()
tf=hashingTF.transform(docs)

tf.cache()

idf=IDF().fit(tf)
tfidf=idf.transform(tf)

tfidf.map(lambda t: ' '.join([u'{}:{}'.format(t[0], t[1]) for t in
zip(t.indices, t.values)])) \
.saveAsTextFile('tfidf.txt')

I started the scipt with:
spark-submit --master yarn --num-executors 24 script.py

No comments about why I selected 24 executors - that is just first try.


I saw in the output that all 24 executors and corresponding blockmanagers
with 0.5 GB on each of them were started on 3 nodes but output stops on
messages:
INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on node3:36765
(size: 49.7 KB, free: 530.0 MB)
INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on node3:36765
(size: 21.6 KB, free: 529.9 MB)

I'm waiting already about 1 hour and don't see any changes. (Unfortunately I
cannot monitor the cluster via Web UI)

My main question is it generally speaking normal time of processing for such
volume of data and such cluster?
Is it ok that output stops on "Added broadcast..."?
Is it ok to read gzip archive via sc.textFile(..) in code or is it better to
unpack it before (from performance purposes)?
How to monitor Spark task via command line?
Please advise about some tuning.

Thanks!

Sergey.














--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Processing-of-text-file-in-large-gzip-archive-tp22073.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to