[Spark SQL] Can explode array of structs in correlated subquery

2018-06-17 Thread bobotu
I'm not sure how to describe this scenario in words, let's see some example SQL. Given the table schema: create table customer ( c_custkey bigint, c_namestring, c_orders array> ) Now I want to know each customer's `avg(o_totalprice)`. Maybe I can use

Re: how can I run spark job in my environment which is a single Ubuntu host with no hadoop installed

2018-06-17 Thread Matei Zaharia
Maybe your application is overriding the master variable when it creates its SparkContext. I see you are still passing “yarn-client” as an argument later to it in your command. > On Jun 17, 2018, at 11:53 AM, Raymond Xie wrote: > > Thank you Subhash. > > Here is the new command: >

[Spark-sql Dataset] .as[SomeClass] not modifying Physical Plan

2018-06-17 Thread Daniel Pires
Hi everyone, I am trying to understand the behaviour of .as[SomeClass] (Dataset API): Say I have a file with Users: case class User(id: Int, name: String, address: String, date_add: java.sql.Date) val users = sc.parallelize(Stream.fill(100)(User(0, "test", "Test Street", new java.sql.Date(0,

making query state checkpoint compatible in structured streaming

2018-06-17 Thread puneetloya
Consider there is a spark query(A) which is dependent on Kafka topics t1 and t2. After running this query in the streaming mode, a checkpoint(C1) directory for the query gets created with offsets and sources directories. Now I add a third topic(t3) on which the query is dependent. Now if I

Re: how can I run spark job in my environment which is a single Ubuntu host with no hadoop installed

2018-06-17 Thread Raymond Xie
Thank you Subhash. Here is the new command: spark-submit --master local[*] --class retail_db.GetRevenuePerOrder --conf spark.ui.port=12678 spark2practice_2.11-0.1.jar yarn-client /public/retail_db/order_items /home/rxie/output/revenueperorder Still seeing the same issue here. 2018-06-17 11:51:25

Re: how can I run spark job in my environment which is a single Ubuntu host with no hadoop installed

2018-06-17 Thread Subhash Sriram
Hi Raymond, If you set your master to local[*] instead of yarn-client, it should run on your local machine. Thanks, Subhash Sent from my iPhone > On Jun 17, 2018, at 2:32 PM, Raymond Xie wrote: > > Hello, > > I am wondering how can I run spark job in my environment which is a single >

how can I run spark job in my environment which is a single Ubuntu host with no hadoop installed

2018-06-17 Thread Raymond Xie
Hello, I am wondering how can I run spark job in my environment which is a single Ubuntu host with no hadoop installed? if I run my job like below, I will end up with infinite loop at the end. Thank you very much. rxie@ubuntu:~/data$ spark-submit --class retail_db.GetRevenuePerOrder --conf

Re: [Help] Codegen Stage grows beyond 64 KB

2018-06-17 Thread vaquar khan
Totally agreed with Eyal . The problem is that when Java programs generated using Catalyst from programs using DataFrame and Dataset are compiled into Java bytecode, the size of byte code of one method must not be 64 KB or more, This conflicts with the limitation of the Java class file, which is

Re: Error: Could not find or load main class org.apache.spark.launcher.Main

2018-06-17 Thread Raymond Xie
Thank you Vamshi, Yes the path presumably has been added, here it is: rxie@ubuntu:~/Downloads/spark$ echo $PATH /home/rxie/Downloads/spark

Re: Error: Could not find or load main class org.apache.spark.launcher.Main

2018-06-17 Thread Vamshi Talla
Raymond, Is your SPARK_HOME set? In your .bash_profile, try setting the below: export SPARK_HOME=/home/Downloads/spark (or wherever your spark is downloaded to) once done, source your .bash_profile or restart the shell and try spark-shell Best Regards, Vamshi T

Error: Could not find or load main class org.apache.spark.launcher.Main

2018-06-17 Thread Raymond Xie
Hello, It would be really appreciated if anyone can help sort it out the following path issue for me? I highly doubt this is related to missing path setting but don't know how can I fix it. rxie@ubuntu:~/Downloads/spark$ echo $PATH

spark-shell doesn't start

2018-06-17 Thread Raymond Xie
Hello, I am doing the practice in Ubuntu now, here is the error I am encountering: rxie@ubuntu:~/Downloads/spark/bin$ spark-shell Error: Could not find or load main class org.apache.spark.launcher.Main What am I missing? Thank you very much. Java is installed.

Re: spark-submit Error: Cannot load main class from JAR file

2018-06-17 Thread Vamshi Talla
Hi Raymond, I see that you can make a small correction in your spark-submit command. Your spark-submit command should say: spark-submit --master local --class . < Jar Location and JarName> Example: spark-submit --master local \ --class retail_db.GetRevenuePerOrder

spark-submit Error: Cannot load main class from JAR file

2018-06-17 Thread Raymond Xie
Hello, I am doing the practice in windows now. I have the jar file generated under: C:\RXIE\Learning\Scala\spark2practice\target\scala-2. 11\spark2practice_2.11-0.1.jar The package name is Retail_db and the object is GetRevenuePerOrder. The spark-submit command is: spark-submit

Re: [Help] Codegen Stage grows beyond 64 KB

2018-06-17 Thread Eyal Zituny
Hi Akash, such errors might appear in large spark pipelines, the root cause is a 64kb jvm limitation. the reason that your job isn't failing at the end is due to spark fallback - if code gen is failing, spark compiler will try to create the flow without the code gen (less optimized) if you do not