Re: Spark SQL -- more than two tables for join
Hi, the same problem happens when I try several joins together, such as 'SELECT * FROM sales INNER JOIN magasin ON sales.STO_KEY = magasin.STO_KEY INNER JOIN eans ON (sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY = eans.FORM_KEY)' The error information is as follow: py4j.protocol.Py4JJavaError: An error occurred while calling o1229.sql. : java.lang.RuntimeException: [1.269] failure: ``UNION'' expected but `INNER' fo und SELECT sales.Date AS Date, sales.ID_FOYER AS ID_FOYER, Sales.STO_KEY AS STO_KEY, sales.Quantite AS Quantite, sales.Prix AS Prix, sales.Total AS Total, magasin.F ORM_KEY AS FORM_KEY, eans.UB_KEY AS UB_KEY FROM sales INNER JOIN magasin ON sale s.STO_KEY = magasin.STO_KEY INNER JOIN eans ON (sales.BARC_KEY = eans.BARC_KEY a nd magasin.FORM_KEY = eans.FORM_KEY) ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:60) at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:73) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:260) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces sorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) I have an impression that sparksql doesn't support more than two joins -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-more-than-two-tables-for-join-tp13865p15847.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL -- more than two tables for join
Hi, in fact, the same problem happens when I try several joins together: SELECT * FROM sales INNER JOIN magasin ON sales.STO_KEY = magasin.STO_KEY INNER JOIN eans ON (sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY = eans.FORM_KEY) py4j.protocol.Py4JJavaError: An error occurred while calling o1229.sql. : java.lang.RuntimeException: [1.269] failure: ``UNION'' expected but `INNER' found SELECT sales.Date AS Date, sales.ID_FOYER AS ID_FOYER, Sales.STO_KEY AS STO_KEY,sales.Quantite AS Quantite, sales.Prix AS Prix, sales.Total AS Total, magasin.FORM_KEY AS FORM_KEY, eans.UB_KEY AS UB_KEY FROM sales INNER JOIN magasin ON sales.STO_KEY = magasin.STO_KEY INNER JOIN eans ON (sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY = eans.FORM_KEY) at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:60) at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:73) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:260) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) I use spark 1.1.0, so I have an impression that sparksql doesn't support several joins together. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-more-than-two-tables-for-join-tp13865p15848.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL -- more than two tables for join
The issue is that you're using SQLContext instead of HiveContext. SQLContext implements a smaller subset of the SQL language and so you're getting a SQL parse error because it doesn't support the syntax you have. Look at how you'd write this in HiveQL, and then try doing that with HiveContext. On Oct 7, 2014, at 7:20 AM, Gen gen.tan...@gmail.com wrote: Hi, in fact, the same problem happens when I try several joins together: SELECT * FROM sales INNER JOIN magasin ON sales.STO_KEY = magasin.STO_KEY INNER JOIN eans ON (sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY = eans.FORM_KEY) py4j.protocol.Py4JJavaError: An error occurred while calling o1229.sql. : java.lang.RuntimeException: [1.269] failure: ``UNION'' expected but `INNER' found SELECT sales.Date AS Date, sales.ID_FOYER AS ID_FOYER, Sales.STO_KEY AS STO_KEY,sales.Quantite AS Quantite, sales.Prix AS Prix, sales.Total AS Total, magasin.FORM_KEY AS FORM_KEY, eans.UB_KEY AS UB_KEY FROM sales INNER JOIN magasin ON sales.STO_KEY = magasin.STO_KEY INNER JOIN eans ON (sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY = eans.FORM_KEY) at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:60) at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:73) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:260) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) I use spark 1.1.0, so I have an impression that sparksql doesn't support several joins together. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-more-than-two-tables-for-join-tp13865p15848.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Re: Spark SQL -- more than two tables for join
1.0.1 does not have the support on outer joins (added in 1.1). Can you try 1.1 branch? On Wed, Sep 10, 2014 at 9:28 PM, boyingk...@163.com boyingk...@163.com wrote: Hi,michael : I think Arthur.hk.chan arthur.hk.c...@gmail.com isn't here now,I Can Show something: 1)my spark version is 1.0.1 2) when I use multiple join ,like this: sql(SELECT * FROM youhao_data left join youhao_age on (youhao_data.rowkey=youhao_age.rowkey) left join youhao_totalKiloMeter on (youhao_age.rowkey=youhao_totalKiloMeter.rowkey)) youhao_data,youhao_age,youhao_totalKiloMeter were registerAsTable 。 I take the Exception: Exception in thread main java.lang.RuntimeException: [1.90] failure: ``UNION'' expected but `left' found SELECT * FROM youhao_data left join youhao_age on (youhao_data.rowkey=youhao_age.rowkey) left join youhao_totalKiloMeter on (youhao_age.rowkey=youhao_totalKiloMeter.rowkey) ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:60) at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:69) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:181) at org.apache.spark.examples.sql.SparkSQLHBaseRelation$.main(SparkSQLHBaseRelation.scala:140) at org.apache.spark.examples.sql.SparkSQLHBaseRelation.main(SparkSQLHBaseRelation.scala) -- boyingk...@163.com *From:* Michael Armbrust mich...@databricks.com *Date:* 2014-09-11 00:28 *To:* arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com *CC:* arunshell87 shell.a...@gmail.com; u...@spark.incubator.apache.org *Subject:* Re: Spark SQL -- more than two tables for join What version of Spark SQL are you running here? I think a lot of your concerns have likely been addressed in more recent versions of the code / documentation. (Spark 1.1 should be published in the next few days) In particular, for serious applications you should use a HiveContext and HiveQL as this is a much more complete implementation of a SQL Parser. The one in SQL context is only suggested if the Hive dependencies conflict with your application. 1) spark sql does not support multiple join This is not true. What problem were you running into? 2) spark left join: has performance issue Can you describe your data and query more? 3) spark sql’s cache table: does not support two-tier query I'm not sure what you mean here. 4) spark sql does not support repartition You can repartition SchemaRDDs in the same way as normal RDDs.
Re: Spark SQL -- more than two tables for join
Hi, I too had tried SQL queries with joins, MINUS , subqueries etc but they did not work in Spark Sql. I did not find any documentation on what queries work and what do not work in Spark SQL, may be we have to wait for the Spark book to be released in Feb-2015. I believe you can try HiveQL in Spark for your requirement. Thanks, Arun -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-more-than-two-tables-for-join-tp13865p13877.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL -- more than two tables for join
Hi, May be you can take a look about the following. http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html Good luck. Arthur On 10 Sep, 2014, at 9:09 pm, arunshell87 shell.a...@gmail.com wrote: Hi, I too had tried SQL queries with joins, MINUS , subqueries etc but they did not work in Spark Sql. I did not find any documentation on what queries work and what do not work in Spark SQL, may be we have to wait for the Spark book to be released in Feb-2015. I believe you can try HiveQL in Spark for your requirement. Thanks, Arun -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-more-than-two-tables-for-join-tp13865p13877.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL -- more than two tables for join
What version of Spark SQL are you running here? I think a lot of your concerns have likely been addressed in more recent versions of the code / documentation. (Spark 1.1 should be published in the next few days) In particular, for serious applications you should use a HiveContext and HiveQL as this is a much more complete implementation of a SQL Parser. The one in SQL context is only suggested if the Hive dependencies conflict with your application. 1) spark sql does not support multiple join This is not true. What problem were you running into? 2) spark left join: has performance issue Can you describe your data and query more? 3) spark sql’s cache table: does not support two-tier query I'm not sure what you mean here. 4) spark sql does not support repartition You can repartition SchemaRDDs in the same way as normal RDDs.
Re: Re: Spark SQL -- more than two tables for join
Hi,michael : I think Arthur.hk.chan isn't here now,I Can Show something: 1)my spark version is 1.0.1 2) when I use multiple join ,like this: sql(SELECT * FROM youhao_data left join youhao_age on (youhao_data.rowkey=youhao_age.rowkey) left join youhao_totalKiloMeter on (youhao_age.rowkey=youhao_totalKiloMeter.rowkey)) youhao_data,youhao_age,youhao_totalKiloMeter were registerAsTable 。 I take the Exception: Exception in thread main java.lang.RuntimeException: [1.90] failure: ``UNION'' expected but `left' found SELECT * FROM youhao_data left join youhao_age on (youhao_data.rowkey=youhao_age.rowkey) left join youhao_totalKiloMeter on (youhao_age.rowkey=youhao_totalKiloMeter.rowkey) ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:60) at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:69) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:181) at org.apache.spark.examples.sql.SparkSQLHBaseRelation$.main(SparkSQLHBaseRelation.scala:140) at org.apache.spark.examples.sql.SparkSQLHBaseRelation.main(SparkSQLHBaseRelation.scala) boyingk...@163.com From: Michael Armbrust Date: 2014-09-11 00:28 To: arthur.hk.c...@gmail.com CC: arunshell87; u...@spark.incubator.apache.org Subject: Re: Spark SQL -- more than two tables for join What version of Spark SQL are you running here? I think a lot of your concerns have likely been addressed in more recent versions of the code / documentation. (Spark 1.1 should be published in the next few days) In particular, for serious applications you should use a HiveContext and HiveQL as this is a much more complete implementation of a SQL Parser. The one in SQL context is only suggested if the Hive dependencies conflict with your application. 1) spark sql does not support multiple join This is not true. What problem were you running into? 2) spark left join: has performance issue Can you describe your data and query more? 3) spark sql’s cache table: does not support two-tier query I'm not sure what you mean here. 4) spark sql does not support repartition You can repartition SchemaRDDs in the same way as normal RDDs.