I'm running the following code to load an entire directory of Avros using hadoopRDD.
val input = "hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/*" // Setup the path for the job vai a Hadoop JobConf val jobConf= new JobConf(sc.hadoopConfiguration) jobConf.setJobName("Test Scala Job") FileInputFormat.setInputPaths(jobConf, input) val rdd = sc.hadoopRDD( jobConf, classOf[org.apache.avro.mapred.AvroInputFormat[GenericRecord]], classOf[org.apache.avro.mapred.AvroWrapper[GenericRecord]], classOf[org.apache.hadoop.io.NullWritable], 1) It successfully loads a single file, but when I load an entire directory, I get this: scala> rdd.first 14/05/31 17:03:01 INFO mapred.FileInputFormat: Total input paths to process : 17 14/05/31 17:03:02 INFO spark.SparkContext: Starting job: first at <console>:43 14/05/31 17:03:02 INFO scheduler.DAGScheduler: Got job 0 (first at <console>:43) with 1 output partitions (allowLocal=true) 14/05/31 17:03:02 INFO scheduler.DAGScheduler: Final stage: Stage 0 (first at <console>:43) 14/05/31 17:03:02 INFO scheduler.DAGScheduler: Parents of final stage: List() 14/05/31 17:03:02 INFO scheduler.DAGScheduler: Missing parents: List() 14/05/31 17:03:02 INFO scheduler.DAGScheduler: Computing the requested partition locally 14/05/31 17:03:02 INFO rdd.HadoopRDD: Input split: hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-00000.avro:0+3864 14/05/31 17:03:02 INFO spark.SparkContext: Job finished: first at <console>:43, took 0.43242113 s 14/05/31 17:03:02 INFO spark.SparkContext: Starting job: first at <console>:43 14/05/31 17:03:02 INFO scheduler.DAGScheduler: Got job 1 (first at <console>:43) with 16 output partitions (allowLocal=true) 14/05/31 17:03:02 INFO scheduler.DAGScheduler: Final stage: Stage 1 (first at <console>:43) 14/05/31 17:03:02 INFO scheduler.DAGScheduler: Parents of final stage: List() 14/05/31 17:03:02 INFO scheduler.DAGScheduler: Missing parents: List() 14/05/31 17:03:02 INFO scheduler.DAGScheduler: Submitting Stage 1 (HadoopRDD[0] at hadoopRDD at <console>:40), which has no missing parents 14/05/31 17:03:02 INFO scheduler.DAGScheduler: Submitting 16 missing tasks from Stage 1 (HadoopRDD[0] at hadoopRDD at <console>:40) 14/05/31 17:03:02 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 16 tasks 14/05/31 17:03:17 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 14/05/31 17:03:32 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory ...<many times>... And never finishes. What should I do? -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com