count(*) performance in Hive vs Spark DataFrames

Christopher Brady Wed, 16 Dec 2015 13:19:35 -0800

I'm having an issue where count(*) returns almost immediately usingHive, but takes over 10 min using DataFrames. The table data is on HDFSin an uncompressed CSV format. How is it possible for Hive to get thecount so fast? Is it caching this or putting it in the metastore?

Is there anything I can do to optimize the performance of this usingDataFrames, or should I try doing just the count with Hive using JDBC?


I've tried writing this 2 ways:

try (final JavaSparkContext sc = new JavaSparkContext("yarn-cluster","Test app")) {

    final HiveContext sqlContext = new HiveContext(sc.sc());
    DataFrame df = sqlContext.sql("SELECT count(*) FROM my_table");
    df.collect();
}

try (final JavaSparkContext sc = new JavaSparkContext("yarn-cluster","Test app")) {

    final HiveContext sqlContext = new HiveContext(sc.sc());
    DataFrame df = sqlContext.table("my_table");
    df.count();
}

Thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

count(*) performance in Hive vs Spark DataFrames

Reply via email to