Re: Running SparkSql against Hive tables

James Pirz Mon, 08 Jun 2015 17:43:43 -0700

Thanks for the help!
I am actually trying Spark SQL to run queries against tables that I've
defined in Hive.


I follow theses steps:
- I start hiveserver2 and in Spark, I start Spark's Thrift server by:
$SPARK_HOME/sbin/start-thriftserver.sh --master
spark://spark-master-node-ip:7077

- and I start beeline:
$SPARK_HOME/bin/beeline

- In my beeline session, I connect to my running hiveserver2
!connect jdbc:hive2://hive-node-ip:10000

and I can run queries successfully. But based on hiveserver2 logs, It seems
it actually uses "Hadoop's MR" to run queries,  *not* Spark's workers. My
goals is to access Hive's tables' data, but run queries through Spark SQL
using Spark workers (not Hadoop).

Is it possible to do that via Spark SQL (its CLI) or through its thrift
server ? (I tried to find some basic examples in the documentation, but I
was not able to) - Any suggestion or hint on how I can do that would be
highly appreciated.

Thnx

On Sun, Jun 7, 2015 at 6:39 AM, Cheng Lian <lian.cs....@gmail.com> wrote:

>
>
> On 6/6/15 9:06 AM, James Pirz wrote:
>
> I am pretty new to Spark, and using Spark 1.3.1, I am trying to use 'Spark
> SQL' to run some SQL scripts, on the cluster. I realized that for a better
> performance, it is a good idea to use Parquet files. I have 2 questions
> regarding that:
>
>  1) If I wanna use Spark SQL against  *partitioned & bucketed* tables
> with Parquet format in Hive, does the provided spark binary on the apache
> website support that or do I need to build a new spark binary with some
> additional flags ? (I found a note
> <https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables> 
> in
> the documentation about enabling Hive support, but I could not fully get it
> as what the correct way of building is, if I need to build)
>
> Yes, Hive support is enabled by default now for the binaries on the
> website. However, currently Spark SQL doesn't support buckets yet.
>
>
>  2) Does running Spark SQL against tables in Hive downgrade the
> performance, and it is better that I load parquet files directly to HDFS or
> having Hive in the picture is harmless ?
>
> If you're using Parquet, then it should be fine since by default Spark SQL
> uses its own native Parquet support to read Parquet Hive tables.
>
>
>  Thnx
>
>
>

Re: Running SparkSql against Hive tables

Reply via email to