While testing SparkSQL on top of our Hive metastore, we were trying to cache the data for one partition of the table in memory like this:
CACHE TABLE xyz_20141029 AS SELECT * FROM xyz where date_prefix = 20141029 Table xyz is a hive table which is partitioned with date_prefix. The data is date_prefix = 20141029 directory is one parquet file: hdfs dfs -ls /event_logs/xyz/20141029 Found 1 items -rw-r--r-- 3 ubuntu hadoop 854521061 2014-11-11 22:20 /event_logs/xyz/20141029/part-01493178cd7f2-31eb-3f9d-b004-149a97ac4d79-r-01493.lzo.parquet The file size is no more than 800 MB but still the cache command is taking longer than an hour and is reading data > multiple Gigs from what seems like from the UI with multiple failures. Stage 0(mapPartition) which took longest was running as (from UI): 0 mapPartitions at Exchange.scala:86 +details RDD: HiveTableScan [tid#46,compact#47,date_prefix#45], (MetastoreRelation default, bid, None), Some((CAST(date_prefix#45, DoubleType) = 2.0141029E7)) org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:602) org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:86) org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:45) org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46) org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:44) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:128) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:127) org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46) org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126) org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:84) org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:423) org.apache.spark.sql.SchemaRDD.count(SchemaRDD.scala:343) org.apache.spark.sql.execution.CacheTableCommand.sideEffectResult$lzycompute(commands.scala:168) org.apache.spark.sql.execution.CacheTableCommand.sideEffectResult(commands.scala:159) org.apache.spark.sql.execution.Command$class.execute(commands.scala:46) org.apache.spark.sql.execution.CacheTableCommand.execute(commands.scala:153) org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425) org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425) org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) org.apache.spark.sql.SchemaRDD.<init>(SchemaRDD.scala:105) 2014/11/11 22:28:47 40 min(Duration) 19546/19546(Tasks: Succeeded/Total) 201.1 GB(input) 973.5 KB(Shuffle Write) I need help understanding what is going on and how we can optimize the caching.
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org