hadoopRDD stalls reading entire directory

Russell Jurney Sat, 31 May 2014 17:17:20 -0700

I'm running the following code to load an entire directory of Avros using
hadoopRDD.


val input = "hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/*"

// Setup the path for the job vai a Hadoop JobConf
val jobConf= new JobConf(sc.hadoopConfiguration)
jobConf.setJobName("Test Scala Job")
FileInputFormat.setInputPaths(jobConf, input)

val rdd = sc.hadoopRDD(
  jobConf,
  classOf[org.apache.avro.mapred.AvroInputFormat[GenericRecord]],
  classOf[org.apache.avro.mapred.AvroWrapper[GenericRecord]],
  classOf[org.apache.hadoop.io.NullWritable],
  1)


It successfully loads a single file, but when I load an entire directory, I
get this:

scala> rdd.first

14/05/31 17:03:01 INFO mapred.FileInputFormat: Total input paths to process
: 17
14/05/31 17:03:02 INFO spark.SparkContext: Starting job: first at
<console>:43
14/05/31 17:03:02 INFO scheduler.DAGScheduler: Got job 0 (first at
<console>:43) with 1 output partitions (allowLocal=true)
14/05/31 17:03:02 INFO scheduler.DAGScheduler: Final stage: Stage 0 (first
at <console>:43)

14/05/31 17:03:02 INFO scheduler.DAGScheduler: Parents of final stage:
List()
14/05/31 17:03:02 INFO scheduler.DAGScheduler: Missing parents: List()

14/05/31 17:03:02 INFO scheduler.DAGScheduler: Computing the requested
partition locally

14/05/31 17:03:02 INFO rdd.HadoopRDD: Input split:
hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-00000.avro:0+3864

14/05/31 17:03:02 INFO spark.SparkContext: Job finished: first at
<console>:43, took 0.43242113 s

14/05/31 17:03:02 INFO spark.SparkContext: Starting job: first at
<console>:43
14/05/31 17:03:02 INFO scheduler.DAGScheduler: Got job 1 (first at
<console>:43) with 16 output partitions (allowLocal=true)

14/05/31 17:03:02 INFO scheduler.DAGScheduler: Final stage: Stage 1 (first
at <console>:43)

14/05/31 17:03:02 INFO scheduler.DAGScheduler: Parents of final stage:
List()

14/05/31 17:03:02 INFO scheduler.DAGScheduler: Missing parents: List()

14/05/31 17:03:02 INFO scheduler.DAGScheduler: Submitting Stage 1
(HadoopRDD[0] at hadoopRDD at <console>:40), which has no missing parents

14/05/31 17:03:02 INFO scheduler.DAGScheduler: Submitting 16 missing tasks
from Stage 1 (HadoopRDD[0] at hadoopRDD at <console>:40)

14/05/31 17:03:02 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0
with 16 tasks

14/05/31 17:03:17 WARN scheduler.TaskSchedulerImpl: Initial job has not
accepted any resources; check your cluster UI to ensure that workers are
registered and have sufficient memory

14/05/31 17:03:32 WARN scheduler.TaskSchedulerImpl: Initial job has not
accepted any resources; check your cluster UI to ensure that workers are
registered and have sufficient memory

...<many times>...


And never finishes. What should I do?
-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com

hadoopRDD stalls reading entire directory

Reply via email to