wholeTextFiles(/x/*/*.txt) runs single threaded

2015-07-02 Thread Kostas Kougios
Hi, I got a cluster of 4 machines and I

sc.wholeTextFiles(/x/*/*.txt)

folder x contains subfolders and each subfolder contains thousand of files
with a total of ~1million matching the path expression.

My spark task starts processing the files but single threaded. I can see
that in the sparkUI, only 1 executor is used out of 4. And only 1 thread out
of configured 24:

spark-submit --class com.stratified.articleids.NxmlExtractorJob \
--driver-memory 8g \
--executor-memory 8g \
--num-executors 4 \
--executor-cores 16 \
--master yarn-cluster \
--conf spark.akka.frameSize=128 \
$JAR


My actual code is :

  val rdd=extractIds(sc.wholeTextFiles(xmlDir))
  rdd.saveAsObjectFile(serDir)

Is the saveAsObjectFile causing this and any workarounds?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-x-txt-runs-single-threaded-tp23591.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: wholeTextFiles(/x/*/*.txt) runs single threaded

2015-07-02 Thread Kostas Kougios
In SparkUI I can see it creating 2 stages. I tried
wholeTextFiles().repartition(32) but same threading results.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-x-txt-runs-single-threaded-tp23591p23593.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org