I'm having an issue where count(*) returns almost immediately using Hive, but takes over 10 min using DataFrames. The table data is on HDFS in an uncompressed CSV format. How is it possible for Hive to get the count so fast? Is it caching this or putting it in the metastore?

Is there anything I can do to optimize the performance of this using DataFrames, or should I try doing just the count with Hive using JDBC?

I've tried writing this 2 ways:

try (final JavaSparkContext sc = new JavaSparkContext("yarn-cluster", "Test app")) {
    final HiveContext sqlContext = new HiveContext(sc.sc());
    DataFrame df = sqlContext.sql("SELECT count(*) FROM my_table");
    df.collect();
}

try (final JavaSparkContext sc = new JavaSparkContext("yarn-cluster", "Test app")) {
    final HiveContext sqlContext = new HiveContext(sc.sc());
    DataFrame df = sqlContext.table("my_table");
    df.count();
}

Thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to