Re: Running SparkSql against Hive tables

James Pirz Tue, 09 Jun 2015 10:53:44 -0700

Thanks Ayan, I used beeline in Spark to connect to Hiveserver2 that I
started from my Hive. So as you said, It is really interacting with Hive as
a typical 3rd party application, and it is NOT using Spark execution
engine. I was thinking that it gets metastore info from Hive, but uses
Spark to execute the query.


I already have created & loaded tables in Hive, and now I want to use Spark
to run SQL queries against those tables. I just want to submit SQL queries
in Spark, and against the data in Hive, wout writing an application (Just
similar to the way that one would pass SQL scripts to Hive or Shark). Going
through the Spark documentation, I realized Spark SQL is the component that
I need to use. But do you mean I have to write a client "Spark application"
to do that ? Is there any way that one could pass SQL scripts directly
through command-line & Spark runs it in distributed mode on the cluster,
against the already existing data in Hive ?

On Mon, Jun 8, 2015 at 5:53 PM, ayan guha <guha.a...@gmail.com> wrote:

> I am afraid you are going other way around :) If you want to use Hive in
> spark, you'd need a HiveContext with  hive config files in spark cluster
> (eveery node). This was spark can talk to hive metastore. Then you can
> write queries on hive table using hiveContext's sql method and spark will
> run it (either by reading from hive and creating RDD or lettinghive run the
> query using MR). Final result will be a spark dataFrame.
>
> What you currently doing is using beeline to connect to hive, which should
> work even without spark.
>
> Best
> Ayan
>
> On Tue, Jun 9, 2015 at 10:42 AM, James Pirz <james.p...@gmail.com> wrote:
>
>> Thanks for the help!
>> I am actually trying Spark SQL to run queries against tables that I've
>> defined in Hive.
>>
>> I follow theses steps:
>> - I start hiveserver2 and in Spark, I start Spark's Thrift server by:
>> $SPARK_HOME/sbin/start-thriftserver.sh --master
>> spark://spark-master-node-ip:7077
>>
>> - and I start beeline:
>> $SPARK_HOME/bin/beeline
>>
>> - In my beeline session, I connect to my running hiveserver2
>> !connect jdbc:hive2://hive-node-ip:10000
>>
>> and I can run queries successfully. But based on hiveserver2 logs, It
>> seems it actually uses "Hadoop's MR" to run queries,  *not* Spark's
>> workers. My goals is to access Hive's tables' data, but run queries through
>> Spark SQL using Spark workers (not Hadoop).
>>
>> Is it possible to do that via Spark SQL (its CLI) or through its thrift
>> server ? (I tried to find some basic examples in the documentation, but I
>> was not able to) - Any suggestion or hint on how I can do that would be
>> highly appreciated.
>>
>> Thnx
>>
>> On Sun, Jun 7, 2015 at 6:39 AM, Cheng Lian <lian.cs....@gmail.com> wrote:
>>
>>>
>>>
>>> On 6/6/15 9:06 AM, James Pirz wrote:
>>>
>>> I am pretty new to Spark, and using Spark 1.3.1, I am trying to use
>>> 'Spark SQL' to run some SQL scripts, on the cluster. I realized that for a
>>> better performance, it is a good idea to use Parquet files. I have 2
>>> questions regarding that:
>>>
>>>  1) If I wanna use Spark SQL against  *partitioned & bucketed* tables
>>> with Parquet format in Hive, does the provided spark binary on the apache
>>> website support that or do I need to build a new spark binary with some
>>> additional flags ? (I found a note
>>> <https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables>
>>>  in
>>> the documentation about enabling Hive support, but I could not fully get it
>>> as what the correct way of building is, if I need to build)
>>>
>>> Yes, Hive support is enabled by default now for the binaries on the
>>> website. However, currently Spark SQL doesn't support buckets yet.
>>>
>>>
>>>  2) Does running Spark SQL against tables in Hive downgrade the
>>> performance, and it is better that I load parquet files directly to HDFS or
>>> having Hive in the picture is harmless ?
>>>
>>> If you're using Parquet, then it should be fine since by default Spark
>>> SQL uses its own native Parquet support to read Parquet Hive tables.
>>>
>>>
>>>  Thnx
>>>
>>>
>>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: Running SparkSql against Hive tables

Reply via email to