Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-16 Thread Michael Armbrust
Mostly true.  The execution of two equivalent logical plans will be exactly
the same, independent of the dialect. Resolution can be slightly different
as SQLContext defaults to case sensitive and HiveContext defaults to case
insensitive.

One other very technical detail: The actual planning done by HiveContext
and SQLContext are slightly different as SQLContext does not have
strategies for reading data from HiveTables. All other operators should be
the same though.  This is not a difference though that has anything to do
with the dialect.

On Wed, Jul 16, 2014 at 2:13 PM, Jerry Lam  wrote:

> Hi Michael,
>
> Thank you for the explanation. Can you validate the following statement is
> true/incomplete/false:
> "hql uses Hive to parse and to construct the logical plan whereas sql is
> pure spark implementation of parsing and logical plan construction. Once
> spark obtains the logical plan, it is executed in spark regardless of
> dialect although the execution might be different for the same query."
>
> Best Regards,
>
> Jerry
>
>
> On Tue, Jul 15, 2014 at 6:22 PM, Michael Armbrust 
> wrote:
>
>> hql and sql are just two different dialects for interacting with data.
>>  After parsing is complete and the logical plan is constructed, the
>> execution is exactly the same.
>>
>>
>> On Tue, Jul 15, 2014 at 2:50 PM, Jerry Lam  wrote:
>>
>>> Hi Michael,
>>>
>>> I don't understand the difference between hql (HiveContext) and sql
>>> (SQLContext). My previous understanding was that hql is hive specific.
>>> Unless the table is managed by Hive, we should use sql. For instance, RDD
>>> (hdfsRDD) created from files in HDFS and registered as a table should use
>>> sql.
>>>
>>> However, my current understanding after trying your suggestion above is
>>> that I can also query the hdfsRDD using hql via LocalHiveContext. I just
>>> tested it, the lateral view explode(schools) works with the hdfsRDD.
>>>
>>> It seems to me that the HiveContext and SQLContext is the same except
>>> that HiveContext needs a metastore and it has a more powerful SQL support
>>> borrowed from Hive. Can you shed some lights on this when you get a minute?
>>>
>>> Thanks,
>>>
>>> Jerry
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jul 15, 2014 at 4:32 PM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
 No, that is why I included the link to SPARK-2096
  as well.  You'll
 need to use HiveQL at this time.

 Is it possible or planed to support the "schools.time" format to filter
>> the
>> record that there is an element inside array of schools satisfy time
>> > 2?
>>
>
 It would be great to support something like this, but its going to take
 a while to hammer out the correct semantics as SQL does not in general have
 great support for nested structures.  I think different people might
 interpret that query to mean there is SOME school.time >2 vs. ALL
 school.time > 2, etc.

 You can get what you want now using a lateral view:

 hql("SELECT DISTINCT name FROM people LATERAL VIEW explode(schools) s
 as school WHERE school.time > 2")

>>>
>>>
>>
>


Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-16 Thread Jerry Lam
Hi Michael,

Thank you for the explanation. Can you validate the following statement is
true/incomplete/false:
"hql uses Hive to parse and to construct the logical plan whereas sql is
pure spark implementation of parsing and logical plan construction. Once
spark obtains the logical plan, it is executed in spark regardless of
dialect although the execution might be different for the same query."

Best Regards,

Jerry


On Tue, Jul 15, 2014 at 6:22 PM, Michael Armbrust 
wrote:

> hql and sql are just two different dialects for interacting with data.
>  After parsing is complete and the logical plan is constructed, the
> execution is exactly the same.
>
>
> On Tue, Jul 15, 2014 at 2:50 PM, Jerry Lam  wrote:
>
>> Hi Michael,
>>
>> I don't understand the difference between hql (HiveContext) and sql
>> (SQLContext). My previous understanding was that hql is hive specific.
>> Unless the table is managed by Hive, we should use sql. For instance, RDD
>> (hdfsRDD) created from files in HDFS and registered as a table should use
>> sql.
>>
>> However, my current understanding after trying your suggestion above is
>> that I can also query the hdfsRDD using hql via LocalHiveContext. I just
>> tested it, the lateral view explode(schools) works with the hdfsRDD.
>>
>> It seems to me that the HiveContext and SQLContext is the same except
>> that HiveContext needs a metastore and it has a more powerful SQL support
>> borrowed from Hive. Can you shed some lights on this when you get a minute?
>>
>> Thanks,
>>
>> Jerry
>>
>>
>>
>>
>>
>> On Tue, Jul 15, 2014 at 4:32 PM, Michael Armbrust > > wrote:
>>
>>> No, that is why I included the link to SPARK-2096
>>>  as well.  You'll
>>> need to use HiveQL at this time.
>>>
>>> Is it possible or planed to support the "schools.time" format to filter
> the
> record that there is an element inside array of schools satisfy time >
> 2?
>

>>> It would be great to support something like this, but its going to take
>>> a while to hammer out the correct semantics as SQL does not in general have
>>> great support for nested structures.  I think different people might
>>> interpret that query to mean there is SOME school.time >2 vs. ALL
>>> school.time > 2, etc.
>>>
>>> You can get what you want now using a lateral view:
>>>
>>> hql("SELECT DISTINCT name FROM people LATERAL VIEW explode(schools) s as
>>> school WHERE school.time > 2")
>>>
>>
>>
>


Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-15 Thread Michael Armbrust
hql and sql are just two different dialects for interacting with data.
 After parsing is complete and the logical plan is constructed, the
execution is exactly the same.


On Tue, Jul 15, 2014 at 2:50 PM, Jerry Lam  wrote:

> Hi Michael,
>
> I don't understand the difference between hql (HiveContext) and sql
> (SQLContext). My previous understanding was that hql is hive specific.
> Unless the table is managed by Hive, we should use sql. For instance, RDD
> (hdfsRDD) created from files in HDFS and registered as a table should use
> sql.
>
> However, my current understanding after trying your suggestion above is
> that I can also query the hdfsRDD using hql via LocalHiveContext. I just
> tested it, the lateral view explode(schools) works with the hdfsRDD.
>
> It seems to me that the HiveContext and SQLContext is the same except that
> HiveContext needs a metastore and it has a more powerful SQL support
> borrowed from Hive. Can you shed some lights on this when you get a minute?
>
> Thanks,
>
> Jerry
>
>
>
>
>
> On Tue, Jul 15, 2014 at 4:32 PM, Michael Armbrust 
> wrote:
>
>> No, that is why I included the link to SPARK-2096
>>  as well.  You'll need
>> to use HiveQL at this time.
>>
>> Is it possible or planed to support the "schools.time" format to filter
 the
 record that there is an element inside array of schools satisfy time >
 2?

>>>
>> It would be great to support something like this, but its going to take a
>> while to hammer out the correct semantics as SQL does not in general have
>> great support for nested structures.  I think different people might
>> interpret that query to mean there is SOME school.time >2 vs. ALL
>> school.time > 2, etc.
>>
>> You can get what you want now using a lateral view:
>>
>> hql("SELECT DISTINCT name FROM people LATERAL VIEW explode(schools) s as
>> school WHERE school.time > 2")
>>
>
>


Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-15 Thread Jerry Lam
Hi Michael,

I don't understand the difference between hql (HiveContext) and sql
(SQLContext). My previous understanding was that hql is hive specific.
Unless the table is managed by Hive, we should use sql. For instance, RDD
(hdfsRDD) created from files in HDFS and registered as a table should use
sql.

However, my current understanding after trying your suggestion above is
that I can also query the hdfsRDD using hql via LocalHiveContext. I just
tested it, the lateral view explode(schools) works with the hdfsRDD.

It seems to me that the HiveContext and SQLContext is the same except that
HiveContext needs a metastore and it has a more powerful SQL support
borrowed from Hive. Can you shed some lights on this when you get a minute?

Thanks,

Jerry





On Tue, Jul 15, 2014 at 4:32 PM, Michael Armbrust 
wrote:

> No, that is why I included the link to SPARK-2096
>  as well.  You'll need
> to use HiveQL at this time.
>
> Is it possible or planed to support the "schools.time" format to filter the
>>> record that there is an element inside array of schools satisfy time > 2?
>>>
>>
> It would be great to support something like this, but its going to take a
> while to hammer out the correct semantics as SQL does not in general have
> great support for nested structures.  I think different people might
> interpret that query to mean there is SOME school.time >2 vs. ALL
> school.time > 2, etc.
>
> You can get what you want now using a lateral view:
>
> hql("SELECT DISTINCT name FROM people LATERAL VIEW explode(schools) s as
> school WHERE school.time > 2")
>


Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-15 Thread Michael Armbrust
No, that is why I included the link to SPARK-2096
 as well.  You'll need to
use HiveQL at this time.

Is it possible or planed to support the "schools.time" format to filter the
>> record that there is an element inside array of schools satisfy time > 2?
>>
>
It would be great to support something like this, but its going to take a
while to hammer out the correct semantics as SQL does not in general have
great support for nested structures.  I think different people might
interpret that query to mean there is SOME school.time >2 vs. ALL
school.time > 2, etc.

You can get what you want now using a lateral view:

hql("SELECT DISTINCT name FROM people LATERAL VIEW explode(schools) s as
school WHERE school.time > 2")


Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-15 Thread Jerry Lam
Hi guys,

Sorry, I'm also interested in this nested json structure.
I have a similar SQL in which I need to query a nested field in a json.
Does the above query works if it is used with sql(sqlText) assuming the
data is coming directly from hdfs via sqlContext.jsonFile?

The SPARK-2483  seems to
address only HiveQL.

Best Regards,

Jerry



On Tue, Jul 15, 2014 at 3:38 AM, anyweil  wrote:

> Thank you so much for the information, now i have merge the fix of #1411
> and
> seems the HiveSQL works with:
> SELECT name FROM people WHERE schools[0].time>2.
>
> But one more question is:
>
> Is it possible or planed to support the "schools.time" format to filter the
> record that there is an element inside array of schools satisfy time > 2?
>
> The above requirement should be more general than the schools[0].time>2, as
> we sometime don't know which element in the array should satisfy the
> condition (we do not know if we should use 0 or 1 or X in the
> schools[X].time), we only care if there is one satisfy the condition, thank
> you!
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Query-the-nested-JSON-data-With-Spark-SQL-1-0-1-tp9544p9741.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-15 Thread anyweil
Thank you so much for the information, now i have merge the fix of #1411 and
seems the HiveSQL works with:
SELECT name FROM people WHERE schools[0].time>2.

But one more question is:

Is it possible or planed to support the "schools.time" format to filter the
record that there is an element inside array of schools satisfy time > 2?

The above requirement should be more general than the schools[0].time>2, as
we sometime don't know which element in the array should satisfy the
condition (we do not know if we should use 0 or 1 or X in the
schools[X].time), we only care if there is one satisfy the condition, thank
you!





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Query-the-nested-JSON-data-With-Spark-SQL-1-0-1-tp9544p9741.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-14 Thread Michael Armbrust
Sorry for the trouble.  There are two issues here:
 - Parsing of repeated nested (i.e. something[0].field) is not supported in
the plain SQL parser. SPARK-2096

 - Resolution is broken in the HiveQL parser. SPARK-2483


The latter issue is fixed now: #1411


Michael


On Mon, Jul 14, 2014 at 11:38 PM, anyweil  wrote:

> Thank you so much for the reply, here is my code.
>
> 1.   val conf = new SparkConf().setAppName("Simple Application")
> 2.   conf.setMaster("local")
> 3.   val sc = new SparkContext(conf)
> 4.   val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> 5.   import sqlContext.createSchemaRDD
> 6.   val path1 = "./data/people.json"
> 7.   val people = sqlContext.jsonFile(path1)
> 8.   people.registerAsTable("people")
> 9.   var sql="SELECT name FROM people WHERE schools.time>2"
> 10. val result = sqlContext.sql(sql)
> 11. result.collect().foreach(println)
>
> the content of people.json is:
> {"name":"Michael",
> "schools":[{"name":"ABC","time":1994},{"name":"EFG","time":2000}]}
> {"name":"Andy", "age":30,"scores":{"eng":98,"phy":89}}
> {"name":"Justin", "age":19}
>
> What I have tried is:
> *1. use HiveSQL:*
> I have tried to replace:
> line 4 with
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> line 10 with
> val result = sqlContext.hql(sql)
> (i have recomplie the spark jar with hive support), but seems got the same
> error.
>
> *2. use []. for the access:*
> I have tried to replace:
> line 9 with:
> var sql="SELECT name FROM people WHERE schools[0].time>2", but got the
> error:
>
> 14/07/15 14:37:49 INFO SparkContext: Job finished: reduce at
> JsonRDD.scala:40, took 0.98412 s
> Exception in thread "main" java.lang.RuntimeException: [1.41] failure:
> ``UNION'' expected but identifier .time found
>
> SELECT name FROM people WHERE schools[0].time>2
> ^
> at scala.sys.package$.error(package.scala:27)
> at
> org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:60)
> at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:69)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:185)
> at SimpleApp$.main(SimpleApp.scala:32)
> at SimpleApp.main(SimpleApp.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
>
> seems not supported.
>
>
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Query-the-nested-JSON-data-With-Spark-SQL-1-0-1-tp9544p9731.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-14 Thread anyweil
Thank you so much for the reply, here is my code.

1.   val conf = new SparkConf().setAppName("Simple Application")
2.   conf.setMaster("local")
3.   val sc = new SparkContext(conf)
4.   val sqlContext = new org.apache.spark.sql.SQLContext(sc)
5.   import sqlContext.createSchemaRDD
6.   val path1 = "./data/people.json"
7.   val people = sqlContext.jsonFile(path1)
8.   people.registerAsTable("people")
9.   var sql="SELECT name FROM people WHERE schools.time>2"
10. val result = sqlContext.sql(sql)
11. result.collect().foreach(println)

the content of people.json is:
{"name":"Michael",
"schools":[{"name":"ABC","time":1994},{"name":"EFG","time":2000}]}
{"name":"Andy", "age":30,"scores":{"eng":98,"phy":89}}
{"name":"Justin", "age":19}

What I have tried is:
*1. use HiveSQL:*
I have tried to replace:
line 4 with 
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
line 10 with
val result = sqlContext.hql(sql)
(i have recomplie the spark jar with hive support), but seems got the same
error.

*2. use []. for the access:*
I have tried to replace:
line 9 with:
var sql="SELECT name FROM people WHERE schools[0].time>2", but got the
error:

14/07/15 14:37:49 INFO SparkContext: Job finished: reduce at
JsonRDD.scala:40, took 0.98412 s
Exception in thread "main" java.lang.RuntimeException: [1.41] failure:
``UNION'' expected but identifier .time found

SELECT name FROM people WHERE schools[0].time>2
^
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:60)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:69)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:185)
at SimpleApp$.main(SimpleApp.scala:32)
at SimpleApp.main(SimpleApp.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)

seems not supported.








--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Query-the-nested-JSON-data-With-Spark-SQL-1-0-1-tp9544p9731.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-14 Thread Michael Armbrust
Handling of complex types is somewhat limited in SQL at the moment.  It'll
be more complete if you use HiveQL.

That said, the problem here is you are calling .name on an array.  You need
to pick an item from the array (using [..]) or use something like a lateral
view explode.


On Sat, Jul 12, 2014 at 11:16 PM, anyweil  wrote:

> Hi All:
>
> I am using Spark SQL 1.0.1 for a simple test, the loaded data (JSON format)
> which is registered as table "people" is:
>
> {"name":"Michael",
> "schools":[{"name":"ABC","time":1994},{"name":"EFG","time":2000}]}
> {"name":"Andy", "age":30,"scores":{"eng":98,"phy":89}}
> {"name":"Justin", "age":19}
>
> the schools has repeated value {"name":"XXX","time":X}, how should I write
> the SQL to select the people who has schools with name "ABC"? I have tried
> "SELECT name FROM people WHERE schools.name = 'ABC' ",but seems wrong
> with:
>
> [error] (run-main-0)
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
> attributes: 'name, tree:
> [error] Project ['name]
> [error]  Filter ('schools.name = ABC)
> [error]   Subquery people
> [error]ParquetRelation people.parquet, Some(Configuration:
> core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml)
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
> attributes: 'name, tree:
>
> Project ['name]
>  Filter ('schools.name = ABC)
>   Subquery people
>ParquetRelation people.parquet, Some(Configuration: core-default.xml,
> core-site.xml, mapred-default.xml, mapred-site.xml)
>
> at
>
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:71)
> ...
>
> Could anybody show me how to write a right SQL for the repeated data item
> search in Spark SQL? Thank you!
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Repeated-data-item-search-with-Spark-SQL-1-0-1-tp9544.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>