I'm having an issue where count(*) returns almost immediately using
Hive, but takes over 10 min using DataFrames. The table data is on HDFS
in an uncompressed CSV format. How is it possible for Hive to get the
count so fast? Is it caching this or putting it in the metastore?
Is there anything I can do to optimize the performance of this using
DataFrames, or should I try doing just the count with Hive using JDBC?
I've tried writing this 2 ways:
try (final JavaSparkContext sc = new JavaSparkContext("yarn-cluster",
"Test app")) {
final HiveContext sqlContext = new HiveContext(sc.sc());
DataFrame df = sqlContext.sql("SELECT count(*) FROM my_table");
df.collect();
}
try (final JavaSparkContext sc = new JavaSparkContext("yarn-cluster",
"Test app")) {
final HiveContext sqlContext = new HiveContext(sc.sc());
DataFrame df = sqlContext.table("my_table");
df.count();
}
Thanks.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org