calculating TF-IDF for large 100GB dataset problems

sergunok Thu, 19 Mar 2015 05:18:07 -0700

Hi,

I try to vectorize on yarn cluster corpus of texts (about 500K texts in 13
files - 100GB totally) located in HDFS .


This process already token about 20 hours on 3 node cluster with 6 cores,
20GB RAM on each node.

In my opinion it's to long :-)

I started the task with the following command:
spark-submit --master yarn --num-executors 9 --executor-memory 5GB
--excutor-cores=2 --driver-memory 5GB weight.py  

weight.py:
from pyspark import SparkConf, SparkContext
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF
from pyspark.mllib.feature import Normalizer

conf = SparkConf() \
        .set("spark.hadoop.validateOutputSpecs", "false") \
        .set("spark.yarn.executor.memoryOverhead", "900")
sc = SparkContext(conf=conf)


# reading files from directory 'in/texts.txt' in HDFS
texts=sc.textFile('in/texts.txt') \
.map(lambda line: line.split())

hashingTF = HashingTF()
tf = hashingTF.transform(texts)

tf.cache()
idf = IDF(minDocFreq=100).fit(tf)
tfidf = idf.transform(tf)

n=Normalizer()

normalized=n.transform(tfidf)

def x2((vec, num)):
    triples=[]
    for id, weight in zip(vec.indices, vec.values):
            triples.append((num, id, weight))
    return triples

# I use zipWithIndex to enumerate documents
normalized.zipWithIndex() \
.flatMap(x2) \
.map(lambda t: '{}\t{}\t{}'.format(t[0],t[1],t[2])) \
.saveAsTextFile('out/weights.txt')

1) What could be a bottleneck? 
Unfortunately I don't have access to the web UI.
In the log file I see stages: 0,1,2,3
Stage 0 "MapPartitionsRDD[6] at mapPartitionsWithIndex at
RDDFunctions.scala:108" with 584 tasks completed very quick
Stage 1 "MappedRDD[8] at values at RDDFunctions.scala:110" (23 tasks) -
quick too
Stage 2 "zipWithIndex" (584 tasks) was long (17 hours)
Stage 3 "saveAsTextFile" (584 tasks) - too (still executing about 2 hours)

I don't understand bounds of Stages 0,1..
And don't understand why I I see numbers like 584 or 23 tasks on stages.


2) On previous start of this task I saw a lot of "executor lost" errors of
yarn scheduler. Later I added .set("spark.yarn.executor.memoryOverhead",
"900") setting in code and now I see only a few such messages.  Could it be
a reason of poor performance?

Please advise!

Any explainations appreciated!

Serg.







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/calculating-TF-IDF-for-large-100GB-dataset-problems-tp22144.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

calculating TF-IDF for large 100GB dataset problems

Reply via email to