On Fri, Feb 5, 2016 at 12:58 PM, Gerard Maas <gerard.m...@gmail.com> wrote:
> Hi, > > We're facing a situation where simple queries to parquet files stored in > Swift through a Hive Metastore sometimes fail with this exception: > > org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 > in stage 58.0 failed 4 times, most recent failure: Lost task 6.3 in stage > 58.0 (TID 412, agent-1.mesos.private): > org.apache.hadoop.fs.swift.exceptions.SwiftConfigurationException: Missing > mandatory configuration option: fs.swift.service.######.auth.url > at > org.apache.hadoop.fs.swift.http.RestClientBindings.copy(RestClientBindings.java:219) > (...) > > Queries requiring a full table scan, like select(count(*)) would fail with > the mentioned exception while smaller chunks of work like " select * > from... LIMIT 5" would succeed. > ... An update: When using the Zeppelin Notebook on a Mesos cluster, as a _workaround_ I can get the Notebook running reliably when using this setting and starting with this paragraph: * spark.mesos.coarse = true || import util.Random.nextInt || sc.parallelize((0 to 1000).toList, 20).toDF.write.parquet(s"swift://###/test/${util.Random.nextInt}" This parquet write will touch all the executors (4 worker nodes in this experiment). So, it looks like _writing_ once, at the start of the Notebook will distribute the swift authentication data to the executors and after that, alle queries just work (including the count(*) queries that failed before). This is using a Zeppelin notebook with Spark 1.5.1 with Hadoop 2.4. HTH, Peter