Re: Repeated data item search with Spark SQL(1.0.1)

Michael Armbrust Wed, 16 Jul 2014 12:21:18 -0700

Mostly true.  The execution of two equivalent logical plans will be exactly
the same, independent of the dialect. Resolution can be slightly different
as SQLContext defaults to case sensitive and HiveContext defaults to case
insensitive.


One other very technical detail: The actual planning done by HiveContext
and SQLContext are slightly different as SQLContext does not have
strategies for reading data from HiveTables. All other operators should be
the same though.  This is not a difference though that has anything to do
with the dialect.

On Wed, Jul 16, 2014 at 2:13 PM, Jerry Lam <chiling...@gmail.com> wrote:

> Hi Michael,
>
> Thank you for the explanation. Can you validate the following statement is
> true/incomplete/false:
> "hql uses Hive to parse and to construct the logical plan whereas sql is
> pure spark implementation of parsing and logical plan construction. Once
> spark obtains the logical plan, it is executed in spark regardless of
> dialect although the execution might be different for the same query."
>
> Best Regards,
>
> Jerry
>
>
> On Tue, Jul 15, 2014 at 6:22 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> hql and sql are just two different dialects for interacting with data.
>>  After parsing is complete and the logical plan is constructed, the
>> execution is exactly the same.
>>
>>
>> On Tue, Jul 15, 2014 at 2:50 PM, Jerry Lam <chiling...@gmail.com> wrote:
>>
>>> Hi Michael,
>>>
>>> I don't understand the difference between hql (HiveContext) and sql
>>> (SQLContext). My previous understanding was that hql is hive specific.
>>> Unless the table is managed by Hive, we should use sql. For instance, RDD
>>> (hdfsRDD) created from files in HDFS and registered as a table should use
>>> sql.
>>>
>>> However, my current understanding after trying your suggestion above is
>>> that I can also query the hdfsRDD using hql via LocalHiveContext. I just
>>> tested it, the lateral view explode(schools) works with the hdfsRDD.
>>>
>>> It seems to me that the HiveContext and SQLContext is the same except
>>> that HiveContext needs a metastore and it has a more powerful SQL support
>>> borrowed from Hive. Can you shed some lights on this when you get a minute?
>>>
>>> Thanks,
>>>
>>> Jerry
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jul 15, 2014 at 4:32 PM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
>>>> No, that is why I included the link to SPARK-2096
>>>> <https://issues.apache.org/jira/browse/SPARK-2096> as well.  You'll
>>>> need to use HiveQL at this time.
>>>>
>>>> Is it possible or planed to support the "schools.time" format to filter
>>>>>> the
>>>>>> record that there is an element inside array of schools satisfy time
>>>>>> > 2?
>>>>>>
>>>>>
>>>> It would be great to support something like this, but its going to take
>>>> a while to hammer out the correct semantics as SQL does not in general have
>>>> great support for nested structures.  I think different people might
>>>> interpret that query to mean there is SOME school.time >2 vs. ALL
>>>> school.time > 2, etc.
>>>>
>>>> You can get what you want now using a lateral view:
>>>>
>>>> hql("SELECT DISTINCT name FROM people LATERAL VIEW explode(schools) s
>>>> as school WHERE school.time > 2")
>>>>
>>>
>>>
>>
>

Re: Repeated data item search with Spark SQL(1.0.1)

Reply via email to