Re: Spark SQL -- more than two tables for join

2014-10-07 Thread TANG Gen
Hi, the same problem happens when I try several joins together, such as
'SELECT * FROM sales INNER JOIN magasin ON sales.STO_KEY = magasin.STO_KEY
INNER JOIN eans ON (sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY =
eans.FORM_KEY)'

The error information is as follow: 
py4j.protocol.Py4JJavaError: An error occurred while calling o1229.sql.
: java.lang.RuntimeException: [1.269] failure: ``UNION'' expected but
`INNER' fo  
  
und

SELECT sales.Date AS Date, sales.ID_FOYER AS ID_FOYER, Sales.STO_KEY AS
STO_KEY,
 
sales.Quantite AS Quantite, sales.Prix AS Prix, sales.Total AS Total,
magasin.F   
 
ORM_KEY AS FORM_KEY, eans.UB_KEY AS UB_KEY FROM sales INNER JOIN magasin ON
sale

s.STO_KEY = magasin.STO_KEY INNER JOIN eans ON (sales.BARC_KEY =
eans.BARC_KEY a 
   
nd magasin.FORM_KEY = eans.FORM_KEY)







  
^
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:60)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:73)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:260)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.   

 
java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces   

 
sorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)


I have an impression that sparksql doesn't support more than two joins



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-more-than-two-tables-for-join-tp13865p15847.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL -- more than two tables for join

2014-10-07 Thread Gen
Hi, in fact, the same problem happens when I try several joins together:

SELECT * 
FROM sales INNER JOIN magasin ON sales.STO_KEY = magasin.STO_KEY 
INNER JOIN eans ON (sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY =
eans.FORM_KEY)

py4j.protocol.Py4JJavaError: An error occurred while calling o1229.sql.
: java.lang.RuntimeException: [1.269] failure: ``UNION'' expected but
`INNER' found

SELECT sales.Date AS Date, sales.ID_FOYER AS ID_FOYER, Sales.STO_KEY AS
STO_KEY,sales.Quantite AS Quantite, sales.Prix AS Prix, sales.Total AS
Total, magasin.FORM_KEY AS FORM_KEY, eans.UB_KEY AS UB_KEY FROM sales INNER
JOIN magasin ON sales.STO_KEY = magasin.STO_KEY INNER JOIN eans ON
(sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY = eans.FORM_KEY)

at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:60)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:73)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:260)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)

I use spark 1.1.0, so I have an impression that sparksql doesn't support
several joins together. 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-more-than-two-tables-for-join-tp13865p15848.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL -- more than two tables for join

2014-10-07 Thread Matei Zaharia
The issue is that you're using SQLContext instead of HiveContext. SQLContext 
implements a smaller subset of the SQL language and so you're getting a SQL 
parse error because it doesn't support the syntax you have. Look at how you'd 
write this in HiveQL, and then try doing that with HiveContext.

On Oct 7, 2014, at 7:20 AM, Gen gen.tan...@gmail.com wrote:

 Hi, in fact, the same problem happens when I try several joins together:
 
 SELECT * 
 FROM sales INNER JOIN magasin ON sales.STO_KEY = magasin.STO_KEY 
 INNER JOIN eans ON (sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY =
 eans.FORM_KEY)
 
 py4j.protocol.Py4JJavaError: An error occurred while calling o1229.sql.
 : java.lang.RuntimeException: [1.269] failure: ``UNION'' expected but
 `INNER' found
 
 SELECT sales.Date AS Date, sales.ID_FOYER AS ID_FOYER, Sales.STO_KEY AS
 STO_KEY,sales.Quantite AS Quantite, sales.Prix AS Prix, sales.Total AS
 Total, magasin.FORM_KEY AS FORM_KEY, eans.UB_KEY AS UB_KEY FROM sales INNER
 JOIN magasin ON sales.STO_KEY = magasin.STO_KEY INNER JOIN eans ON
 (sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY = eans.FORM_KEY)
 
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:60)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:73)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:260)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at
 py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at
 py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
 
 I use spark 1.1.0, so I have an impression that sparksql doesn't support
 several joins together. 
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-more-than-two-tables-for-join-tp13865p15848.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Re: Spark SQL -- more than two tables for join

2014-09-11 Thread Yin Huai
1.0.1 does not have the support on outer joins (added in 1.1). Can you try
1.1 branch?

On Wed, Sep 10, 2014 at 9:28 PM, boyingk...@163.com boyingk...@163.com
wrote:

  Hi,michael :

 I think Arthur.hk.chan arthur.hk.c...@gmail.com isn't here now,I Can
 Show something:
 1)my spark version is 1.0.1
 2) when I use multiple join ,like this:
 sql(SELECT * FROM youhao_data left join youhao_age on
 (youhao_data.rowkey=youhao_age.rowkey) left join youhao_totalKiloMeter on
 (youhao_age.rowkey=youhao_totalKiloMeter.rowkey))

youhao_data,youhao_age,youhao_totalKiloMeter  were registerAsTable 。

  I take the Exception:
  Exception in thread main java.lang.RuntimeException: [1.90] failure:
 ``UNION'' expected but `left' found

 SELECT * FROM youhao_data left join youhao_age on
 (youhao_data.rowkey=youhao_age.rowkey) left join youhao_totalKiloMeter on
 (youhao_age.rowkey=youhao_totalKiloMeter.rowkey)

 ^
 at scala.sys.package$.error(package.scala:27)
 at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:60)
 at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:69)
 at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:181)
 at
 org.apache.spark.examples.sql.SparkSQLHBaseRelation$.main(SparkSQLHBaseRelation.scala:140)
 at
 org.apache.spark.examples.sql.SparkSQLHBaseRelation.main(SparkSQLHBaseRelation.scala)
 --
  boyingk...@163.com

  *From:* Michael Armbrust mich...@databricks.com
 *Date:* 2014-09-11 00:28
 *To:* arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com
 *CC:* arunshell87 shell.a...@gmail.com; u...@spark.incubator.apache.org
 *Subject:* Re: Spark SQL -- more than two tables for join
What version of Spark SQL are you running here?  I think a lot of your
 concerns have likely been addressed in more recent versions of the code /
 documentation.  (Spark 1.1 should be published in the next few days)

 In particular, for serious applications you should use a HiveContext and
 HiveQL as this is a much more complete implementation of a SQL Parser.  The
 one in SQL context is only suggested if the Hive dependencies conflict with
 your application.


 1)  spark sql does not support multiple join


 This is not true.  What problem were you running into?


 2)  spark left join: has performance issue


 Can you describe your data and query more?


 3)  spark sql’s cache table: does not support two-tier query


 I'm not sure what you mean here.


 4)  spark sql does not support repartition


 You can repartition SchemaRDDs in the same way as normal RDDs.



Re: Spark SQL -- more than two tables for join

2014-09-10 Thread arunshell87

Hi,

I too had tried SQL queries with joins, MINUS , subqueries etc but they did
not work in Spark Sql. 

I did not find any documentation on what queries work and what do not work
in Spark SQL, may be we have to wait for the Spark book to be released in
Feb-2015.

I believe you can try HiveQL in Spark for your requirement.

Thanks,
Arun



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-more-than-two-tables-for-join-tp13865p13877.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL -- more than two tables for join

2014-09-10 Thread arthur.hk.c...@gmail.com
Hi,

May be you can take a look about the following.

http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html

Good luck.
Arthur

On 10 Sep, 2014, at 9:09 pm, arunshell87 shell.a...@gmail.com wrote:

 
 Hi,
 
 I too had tried SQL queries with joins, MINUS , subqueries etc but they did
 not work in Spark Sql. 
 
 I did not find any documentation on what queries work and what do not work
 in Spark SQL, may be we have to wait for the Spark book to be released in
 Feb-2015.
 
 I believe you can try HiveQL in Spark for your requirement.
 
 Thanks,
 Arun
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-more-than-two-tables-for-join-tp13865p13877.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL -- more than two tables for join

2014-09-10 Thread Michael Armbrust
What version of Spark SQL are you running here?  I think a lot of your
concerns have likely been addressed in more recent versions of the code /
documentation.  (Spark 1.1 should be published in the next few days)

In particular, for serious applications you should use a HiveContext and
HiveQL as this is a much more complete implementation of a SQL Parser.  The
one in SQL context is only suggested if the Hive dependencies conflict with
your application.


 1)  spark sql does not support multiple join


This is not true.  What problem were you running into?


 2)  spark left join: has performance issue


Can you describe your data and query more?


 3)  spark sql’s cache table: does not support two-tier query


I'm not sure what you mean here.


 4)  spark sql does not support repartition


You can repartition SchemaRDDs in the same way as normal RDDs.


Re: Re: Spark SQL -- more than two tables for join

2014-09-10 Thread boyingk...@163.com
Hi,michael :

I think Arthur.hk.chan isn't here now,I Can Show something:
1)my spark version is 1.0.1
2) when I use multiple join ,like this:
sql(SELECT * FROM youhao_data left join youhao_age on 
(youhao_data.rowkey=youhao_age.rowkey) left join youhao_totalKiloMeter on 
(youhao_age.rowkey=youhao_totalKiloMeter.rowkey)) 
  
   youhao_data,youhao_age,youhao_totalKiloMeter  were registerAsTable 。

I take the Exception:
Exception in thread main java.lang.RuntimeException: [1.90] failure: 
``UNION'' expected but `left' found

SELECT * FROM youhao_data left join youhao_age on 
(youhao_data.rowkey=youhao_age.rowkey) left join youhao_totalKiloMeter on 
(youhao_age.rowkey=youhao_totalKiloMeter.rowkey)

 ^
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:60)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:69)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:181)
at 
org.apache.spark.examples.sql.SparkSQLHBaseRelation$.main(SparkSQLHBaseRelation.scala:140)
at 
org.apache.spark.examples.sql.SparkSQLHBaseRelation.main(SparkSQLHBaseRelation.scala)



boyingk...@163.com

From: Michael Armbrust
Date: 2014-09-11 00:28
To: arthur.hk.c...@gmail.com
CC: arunshell87; u...@spark.incubator.apache.org
Subject: Re: Spark SQL -- more than two tables for join
What version of Spark SQL are you running here?  I think a lot of your concerns 
have likely been addressed in more recent versions of the code / documentation. 
 (Spark 1.1 should be published in the next few days)


In particular, for serious applications you should use a HiveContext and HiveQL 
as this is a much more complete implementation of a SQL Parser.  The one in SQL 
context is only suggested if the Hive dependencies conflict with your 
application.

1)  spark sql does not support multiple join



This is not true.  What problem were you running into?

2)  spark left join: has performance issue



Can you describe your data and query more?

3)  spark sql’s cache table: does not support two-tier query



I'm not sure what you mean here.

4)  spark sql does not support repartition


You can repartition SchemaRDDs in the same way as normal RDDs.