While testing SparkSQL on top of our Hive metastore, we were trying to
cache the data for one partition of the table in memory like this:

CACHE TABLE xyz_20141029 AS SELECT * FROM xyz where date_prefix = 20141029

Table xyz is a hive table which is partitioned with date_prefix. The data
is date_prefix = 20141029 directory is one parquet file:

hdfs dfs -ls /event_logs/xyz/20141029

Found 1 items

-rw-r--r--   3 ubuntu hadoop  854521061 2014-11-11 22:20
/event_logs/xyz/20141029/part-01493178cd7f2-31eb-3f9d-b004-149a97ac4d79-r-01493.lzo.parquet

The file size is no more than 800 MB but still the cache command is taking
longer than an hour and is reading data > multiple Gigs from what seems
like from the UI with multiple failures.

Stage 0(mapPartition) which took longest was running as (from UI):

0

mapPartitions at Exchange.scala:86 +details

RDD: HiveTableScan [tid#46,compact#47,date_prefix#45], (MetastoreRelation
default, bid, None), Some((CAST(date_prefix#45, DoubleType) = 2.0141029E7))

org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:602)

org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:86)

org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:45)

org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)

org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:44)

org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:128)

org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:127)

org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)

org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126)

org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:84)

org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:423)

org.apache.spark.sql.SchemaRDD.count(SchemaRDD.scala:343)

org.apache.spark.sql.execution.CacheTableCommand.sideEffectResult$lzycompute(commands.scala:168)

org.apache.spark.sql.execution.CacheTableCommand.sideEffectResult(commands.scala:159)

org.apache.spark.sql.execution.Command$class.execute(commands.scala:46)

org.apache.spark.sql.execution.CacheTableCommand.execute(commands.scala:153)

org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)

org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)

org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)

org.apache.spark.sql.SchemaRDD.<init>(SchemaRDD.scala:105)

2014/11/11 22:28:47 40 min(Duration)


19546/19546(Tasks: Succeeded/Total) 201.1 GB(input) 973.5 KB(Shuffle Write)


I need help understanding what is going on and how we can optimize the
caching.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to