Hi, I got a cluster of 4 machines and I

sc.wholeTextFiles("/x/*/*.txt")

folder x contains subfolders and each subfolder contains thousand of files
with a total of ~1million matching the path expression.

My spark task starts processing the files but single threaded. I can see
that in the sparkUI, only 1 executor is used out of 4. And only 1 thread out
of configured 24:

spark-submit --class com.stratified.articleids.NxmlExtractorJob \
        --driver-memory 8g \
        --executor-memory 8g \
        --num-executors 4 \
        --executor-cores 16 \
        --master yarn-cluster \
        --conf spark.akka.frameSize=128 \
        $JAR


My actual code is :

      val rdd=extractIds(sc.wholeTextFiles(xmlDir))
      rdd.saveAsObjectFile(serDir)

Is the saveAsObjectFile causing this and any workarounds?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-x-txt-runs-single-threaded-tp23591.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to