[Spark SQL] Can explode array of structs in correlated subquery

2018-06-17 Thread bobotu
I'm not sure how to describe this scenario in words, let's see some example
SQL.

Given the table schema:

create table customer (
   c_custkey bigint,
   c_namestring,
   c_orders   array>
)

Now I want to know each customer's `avg(o_totalprice)`. Maybe I can use
`explode(c_orders)` to unpack this array, but this method need a extra
`group by`.
   
Impala has a intersting feature to support this:

SELECT c_name, average_price
FROM
   customer c,
   (SELECT AVG(o_totalprice) average_price FROM c.c_orders) subq1

I can use `map` to do the same thing, but the Impala way can utilize
Parquet's projection push-down better.
When I use `map`, optimizer has no idea about which fields I will use, so
Spark will read the whole struct. 
The Impala way let optimizer know which fields I'm actually using, and
reduce the read size.

Although [SPARK-4502] point out Spark's optimizer doesn't use this Parquet
feature, but it seems will be fixed in next major version.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: how can I run spark job in my environment which is a single Ubuntu host with no hadoop installed

2018-06-17 Thread Matei Zaharia
Maybe your application is overriding the master variable when it creates its 
SparkContext. I see you are still passing “yarn-client” as an argument later to 
it in your command.

> On Jun 17, 2018, at 11:53 AM, Raymond Xie  wrote:
> 
> Thank you Subhash.
> 
> Here is the new command:
> spark-submit --master local[*] --class retail_db.GetRevenuePerOrder --conf 
> spark.ui.port=12678 spark2practice_2.11-0.1.jar yarn-client 
> /public/retail_db/order_items /home/rxie/output/revenueperorder
> 
> Still seeing the same issue here.
> 2018-06-17 11:51:25 INFO  RMProxy:98 - Connecting to ResourceManager at 
> /0.0.0.0:8032
> 2018-06-17 11:51:27 INFO  Client:871 - Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is
>   
>RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS)
> 2018-06-17 11:51:28 INFO  Client:871 - Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is
>   
>RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS)
> 2018-06-17 11:51:29 INFO  Client:871 - Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8032. Already tried 2 time(s); retry policy is
>   
>RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS)
> 2018-06-17 11:51:30 INFO  Client:871 - Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8032. Already tried 3 time(s); retry policy is
>   
>RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS)
> 2018-06-17 11:51:31 INFO  Client:871 - Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8032. Already tried 4 time(s); retry policy is
>   
>RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS)
> 2018-06-17 11:51:32 INFO  Client:871 - Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8032. Already tried 5 time(s); retry policy is
>   
>RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS)
> 2018-06-17 11:51:33 INFO  Client:871 - Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8032. Already tried 6 time(s); retry policy is
>   
>RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS)
> 2018-06-17 11:51:34 INFO  Client:871 - Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8032. Already tried 7 time(s); retry policy is
>   
>RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS)
> 2018-06-17 11:51:35 INFO  Client:871 - Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8032. Already tried 8 time(s); retry policy is
>   
>RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS)
> 2018-06-17 11:51:36 INFO  Client:871 - Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8032. Already tried 9 time(s); retry policy is
>   
>RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS)
> 
> 
> 
> 
> Sincerely yours,
> 
> 
> Raymond
> 
> On Sun, Jun 17, 2018 at 2:36 PM, Subhash Sriram  
> wrote:
> Hi Raymond,
> 
> If you set your master to local[*] instead of yarn-client, it should run on 
> your local machine.
> 
> Thanks,
> Subhash 
> 
> Sent from my iPhone
> 
> On Jun 17, 2018, at 2:32 PM, Raymond Xie  wrote:
> 
>> Hello,
>> 
>> I am wondering how can I run spark job in my environment which is a single 
>> Ubuntu host with no hadoop installed? if I run my job like below, I will end 
>> up with infinite loop at the end. Thank you very much.
>> 
>> rxie@ubuntu:~/data$ spark-submit --class retail_db.GetRevenuePerOrder --conf 
>> spark.ui.port=12678 spark2practice_2.11-0.1.jar yarn-client 
>> /public/retail_db/order_items /home/rxie/output/revenueperorder
>> 2018-06-17 11:19:36 WARN  Utils:66 - Your hostname, ubuntu resolves to a 
>> loopback address: 127.0.1.1; using 192.168.112.141 instead (on interface 
>> ens33)
>> 2018-06-17 11:19:36 WARN  Utils:66 - Set SPARK_LOCAL_IP 

[Spark-sql Dataset] .as[SomeClass] not modifying Physical Plan

2018-06-17 Thread Daniel Pires
Hi everyone,

I am trying to understand the behaviour of .as[SomeClass] (Dataset API):

Say I have a file with Users:
case class User(id: Int, name: String, address: String, date_add: java.sql.Date)
val users =  sc.parallelize(Stream.fill(100)(User(0, "test", "Test Street", new 
java.sql.Date(0, 0, 0.toDF()

Say this is a Parquet file in my data-lake and I created a case class (User) 
that defines all the fields in that file.

Now I want to work with only a subset of the fields for a given pipeline, and I 
want to take advantage of the Columnar format the file is saved as (reading a 
subset of columns); but I don’t want to lose the Dataset API.
So I create a sub-case class defining only the few fields I need:
case class UserSub(id: Int, name: String)

Now my first surprise is that when I run:
users.as[UserSub].show()

I get:
+---++---+--+
| id|name|address|  date_add|
+---++---+--+
|  0|test|Test Street|1899-12-31|
|  0|test|Test Street|1899-12-31|
...

Which I understand as: If no operation is called after `as[UserSub]`, Spark 
will not serialise the rows so the extra fields I’m not interested in will not 
be dropped.
I can confirm it by running an explain:
users.as[UserSub].explain
== Physical Plan ==
*(1) SerializeFromObject [assertnotnull(input[0, $line34.$read$$iw$$iw$User, 
true]).id AS id#69, staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
assertnotnull(input[0, $line34.$read$$iw$$iw$User, true]).name, true, false) AS 
name#70, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, 
StringType, fromString, assertnotnull(input[0, $line34.$read$$iw$$iw$User, 
true]).address, true, false) AS address#71, staticinvoke(class 
org.apache.spark.sql.catalyst.util.DateTimeUtils$, DateType, fromJavaDate, 
assertnotnull(input[0, $line34.$read$$iw$$iw$User, true]).date_add, true, 
false) AS date_add#72]
+- Scan ExternalRDDScan[obj#68]
Which shows that the execution plan was not modified.


Now if I run:
users.as[UserSub].map(x => x).show()
+---++
| id|name|
+---++
|  0|test|
|  0|test|
...

The fields were dropped (so I got the end-result that I wanted).
Now if I run an explain:
users.as[UserSub].map(x => x).explain
== Physical Plan ==
*(1) SerializeFromObject [assertnotnull(input[0, $line37.$read$$iw$$iw$UserSub, 
true]).id AS id#141, staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
assertnotnull(input[0, $line37.$read$$iw$$iw$UserSub, true]).name, true, false) 
AS name#142]
+- *(1) MapElements , obj#140: $line37.$read$$iw$$iw$UserSub
   +- *(1) DeserializeToObject newInstance(class 
$line37.$read$$iw$$iw$UserSub), obj#139: $line37.$read$$iw$$iw$UserSub
  +- *(1) Project [id#69, name#70]
 +- *(1) SerializeFromObject [assertnotnull(input[0, 
$line34.$read$$iw$$iw$User, true]).id AS id#69, staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
assertnotnull(input[0, $line34.$read$$iw$$iw$User, true]).name, true, false) AS 
name#70, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, 
StringType, fromString, assertnotnull(input[0, $line34.$read$$iw$$iw$User, 
true]).address, true, false) AS address#71, staticinvoke(class 
org.apache.spark.sql.catalyst.util.DateTimeUtils$, DateType, fromJavaDate, 
assertnotnull(input[0, $line34.$read$$iw$$iw$User, true]).date_add, true, 
false) AS date_add#72]
+- Scan ExternalRDDScan[obj#68]


I see that in that case the physical plan changed,
the line I’m interested in is Project [id#69, name#70]
Which is selecting individual columns just like in the DataFrame API:
users.select("id", "name").explain()
== Physical Plan ==
*(1) Project [id#69, name#70]
+- *(1) SerializeFromObject [assertnotnull(input[0, $line34.$read$$iw$$iw$User, 
true]).id AS id#69, staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
assertnotnull(input[0, $line34.$read$$iw$$iw$User, true]).name, true, false) AS 
name#70, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, 
StringType, fromString, assertnotnull(input[0, $line34.$read$$iw$$iw$User, 
true]).address, true, false) AS address#71, staticinvoke(class 
org.apache.spark.sql.catalyst.util.DateTimeUtils$, DateType, fromJavaDate, 
assertnotnull(input[0, $line34.$read$$iw$$iw$User, true]).date_add, true, 
false) AS date_add#72]
   +- Scan ExternalRDDScan[obj#68]


Am I expecting the wrong things from `.as[SomeClass]` ? Why wouldn’t it by 
default add a Project to the Query Plan ?

Best regards,

Daniel Mateus Pires

Data Engineer




making query state checkpoint compatible in structured streaming

2018-06-17 Thread puneetloya
Consider there is a spark query(A) which is dependent on Kafka topics t1 and
t2.

After running this query in the streaming mode, a checkpoint(C1) directory
for the query gets created with offsets and sources directories. Now I add a
third topic(t3) on which the query is dependent.

Now if I restart spark with the same checkpoint C1, Spark crashes as
expected, as it could not find the entry for the third topic(t3).

So just as part of a hack, I tried to add the topic t3 to the checkpoint
manually to the sources and offset directories of the query in the
checkpoint. But spark still crashed.

Whats the correct way to solve this problem? How to handle such upgrade
paths in structured streaming?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: how can I run spark job in my environment which is a single Ubuntu host with no hadoop installed

2018-06-17 Thread Raymond Xie
Thank you Subhash.

Here is the new command:
spark-submit --master local[*] --class retail_db.GetRevenuePerOrder --conf
spark.ui.port=12678 spark2practice_2.11-0.1.jar yarn-client
/public/retail_db/order_items /home/rxie/output/revenueperorder

Still seeing the same issue here.
2018-06-17 11:51:25 INFO  RMProxy:98 - Connecting to ResourceManager at /
0.0.0.0:8032
2018-06-17 11:51:27 INFO  Client:871 - Retrying connect to server:
0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is

   RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-17 11:51:28 INFO  Client:871 - Retrying connect to server:
0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is

   RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-17 11:51:29 INFO  Client:871 - Retrying connect to server:
0.0.0.0/0.0.0.0:8032. Already tried 2 time(s); retry policy is

   RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-17 11:51:30 INFO  Client:871 - Retrying connect to server:
0.0.0.0/0.0.0.0:8032. Already tried 3 time(s); retry policy is

   RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-17 11:51:31 INFO  Client:871 - Retrying connect to server:
0.0.0.0/0.0.0.0:8032. Already tried 4 time(s); retry policy is

   RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-17 11:51:32 INFO  Client:871 - Retrying connect to server:
0.0.0.0/0.0.0.0:8032. Already tried 5 time(s); retry policy is

   RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-17 11:51:33 INFO  Client:871 - Retrying connect to server:
0.0.0.0/0.0.0.0:8032. Already tried 6 time(s); retry policy is

   RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-17 11:51:34 INFO  Client:871 - Retrying connect to server:
0.0.0.0/0.0.0.0:8032. Already tried 7 time(s); retry policy is

   RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-17 11:51:35 INFO  Client:871 - Retrying connect to server:
0.0.0.0/0.0.0.0:8032. Already tried 8 time(s); retry policy is

   RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-17 11:51:36 INFO  Client:871 - Retrying connect to server:
0.0.0.0/0.0.0.0:8032. Already tried 9 time(s); retry policy is

   RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)



**
*Sincerely yours,*


*Raymond*

On Sun, Jun 17, 2018 at 2:36 PM, Subhash Sriram 
wrote:

> Hi Raymond,
>
> If you set your master to local[*] instead of yarn-client, it should run
> on your local machine.
>
> Thanks,
> Subhash
>
> Sent from my iPhone
>
> On Jun 17, 2018, at 2:32 PM, Raymond Xie  wrote:
>
> Hello,
>
> I am wondering how can I run spark job in my environment which is a single
> Ubuntu host with no hadoop installed? if I run my job like below, I will
> end up with infinite loop at the end. Thank you very much.
>
> rxie@ubuntu:~/data$ spark-submit --class retail_db.GetRevenuePerOrder
> --conf spark.ui.port=12678 spark2practice_2.11-0.1.jar yarn-client
> /public/retail_db/order_items /home/rxie/output/revenueperorder
> 2018-06-17 11:19:36 WARN  Utils:66 - Your hostname, ubuntu resolves to a
> loopback address: 127.0.1.1; using 192.168.112.141 instead (on interface
> ens33)
> 2018-06-17 11:19:36 WARN  Utils:66 - Set SPARK_LOCAL_IP if you need to
> bind to another address
> 2018-06-17 11:19:37 WARN  NativeCodeLoader:62 - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2018-06-17 11:19:38 INFO  SparkContext:54 - Running Spark version 2.3.1
> 2018-06-17 11:19:38 WARN  SparkConf:66 - spark.master yarn-client is
> deprecated in Spark 2.0+, please instead use "yarn" with specified deploy
> mode.
> 2018-06-17 11:19:38 INFO  SparkContext:54 - Submitted application: Get
> Revenue Per Order
> 2018-06-17 11:19:38 INFO  SecurityManager:54 - Changing view acls to: rxie
> 2018-06-17 11:19:38 INFO  SecurityManager:54 - Changing modify acls to:
> rxie
> 2018-06-17 11:19:38 INFO  SecurityManager:54 - Changing view acls groups
> to:
> 2018-06-17 11:19:38 INFO  SecurityManager:54 - Changing modify acls groups
> to:
> 2018-06-17 11:19:38 INFO  SecurityManager:54 - SecurityManager:
> authentication disabled; ui acls disabled; users  with view permissions:
> Set(rxie); groups with view permissions: Set(); users  with modify
> permissions: Set(rxie); groups with modify permissions: Set()
> 2018-06-17 11:19:39 INFO  Utils:54 - Successfully started service
> 'sparkDriver' on port 44709.
> 2018-06-17 11:19:39 INFO  SparkEnv:54 - Registering MapOutputTracker
> 2018-06-17 11:19:39 INFO  

Re: how can I run spark job in my environment which is a single Ubuntu host with no hadoop installed

2018-06-17 Thread Subhash Sriram
Hi Raymond,

If you set your master to local[*] instead of yarn-client, it should run on 
your local machine.

Thanks,
Subhash 

Sent from my iPhone

> On Jun 17, 2018, at 2:32 PM, Raymond Xie  wrote:
> 
> Hello,
> 
> I am wondering how can I run spark job in my environment which is a single 
> Ubuntu host with no hadoop installed? if I run my job like below, I will end 
> up with infinite loop at the end. Thank you very much.
> 
> rxie@ubuntu:~/data$ spark-submit --class retail_db.GetRevenuePerOrder --conf 
> spark.ui.port=12678 spark2practice_2.11-0.1.jar yarn-client 
> /public/retail_db/order_items /home/rxie/output/revenueperorder
> 2018-06-17 11:19:36 WARN  Utils:66 - Your hostname, ubuntu resolves to a 
> loopback address: 127.0.1.1; using 192.168.112.141 instead (on interface 
> ens33)
> 2018-06-17 11:19:36 WARN  Utils:66 - Set SPARK_LOCAL_IP if you need to bind 
> to another address
> 2018-06-17 11:19:37 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 2018-06-17 11:19:38 INFO  SparkContext:54 - Running Spark version 2.3.1
> 2018-06-17 11:19:38 WARN  SparkConf:66 - spark.master yarn-client is 
> deprecated in Spark 2.0+, please instead use "yarn" with specified deploy 
> mode.
> 2018-06-17 11:19:38 INFO  SparkContext:54 - Submitted application: Get 
> Revenue Per Order
> 2018-06-17 11:19:38 INFO  SecurityManager:54 - Changing view acls to: rxie
> 2018-06-17 11:19:38 INFO  SecurityManager:54 - Changing modify acls to: rxie
> 2018-06-17 11:19:38 INFO  SecurityManager:54 - Changing view acls groups to:
> 2018-06-17 11:19:38 INFO  SecurityManager:54 - Changing modify acls groups to:
> 2018-06-17 11:19:38 INFO  SecurityManager:54 - SecurityManager: 
> authentication disabled; ui acls disabled; users  with view permissions: 
> Set(rxie); groups with view permissions: Set(); users  with modify 
> permissions: Set(rxie); groups with modify permissions: Set()
> 2018-06-17 11:19:39 INFO  Utils:54 - Successfully started service 
> 'sparkDriver' on port 44709.
> 2018-06-17 11:19:39 INFO  SparkEnv:54 - Registering MapOutputTracker
> 2018-06-17 11:19:39 INFO  SparkEnv:54 - Registering BlockManagerMaster
> 2018-06-17 11:19:39 INFO  BlockManagerMasterEndpoint:54 - Using 
> org.apache.spark.storage.DefaultTopologyMapper for getting topology 
> information
> 2018-06-17 11:19:39 INFO  BlockManagerMasterEndpoint:54 - 
> BlockManagerMasterEndpoint up
> 2018-06-17 11:19:39 INFO  DiskBlockManager:54 - Created local directory at 
> /tmp/blockmgr-69a8a12d-0881-4454-96ab-6a45d5c58bfe
> 2018-06-17 11:19:39 INFO  MemoryStore:54 - MemoryStore started with capacity 
> 413.9 MB
> 2018-06-17 11:19:39 INFO  SparkEnv:54 - Registering OutputCommitCoordinator
> 2018-06-17 11:19:40 INFO  log:192 - Logging initialized @7035ms
> 2018-06-17 11:19:40 INFO  Server:346 - jetty-9.3.z-SNAPSHOT
> 2018-06-17 11:19:40 INFO  Server:414 - Started @7383ms
> 2018-06-17 11:19:40 INFO  AbstractConnector:278 - Started 
> ServerConnector@51ad75c2{HTTP/1.1,[http/1.1]}{0.0.0.0:12678}
> 2018-06-17 11:19:40 INFO  Utils:54 - Successfully started service 'SparkUI' 
> on port 12678.
> 2018-06-17 11:19:40 INFO  ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@50b8ae8d{/jobs,null,AVAILABLE,@Spark}
> 2018-06-17 11:19:40 INFO  ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@60afd40d{/jobs/json,null,AVAILABLE,@Spark}
> 2018-06-17 11:19:40 INFO  ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@28a2a3e7{/jobs/job,null,AVAILABLE,@Spark}
> 2018-06-17 11:19:40 INFO  ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@10b3df93{/jobs/job/json,null,AVAILABLE,@Spark}
> 2018-06-17 11:19:40 INFO  ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@ea27e34{/stages,null,AVAILABLE,@Spark}
> 2018-06-17 11:19:40 INFO  ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@33a2499c{/stages/json,null,AVAILABLE,@Spark}
> 2018-06-17 11:19:40 INFO  ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@e72dba7{/stages/stage,null,AVAILABLE,@Spark}
> 2018-06-17 11:19:40 INFO  ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@3c321bdb{/stages/stage/json,null,AVAILABLE,@Spark}
> 2018-06-17 11:19:40 INFO  ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@24855019{/stages/pool,null,AVAILABLE,@Spark}
> 2018-06-17 11:19:40 INFO  ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@3abd581e{/stages/pool/json,null,AVAILABLE,@Spark}
> 2018-06-17 11:19:40 INFO  ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@4d4d8fcf{/storage,null,AVAILABLE,@Spark}
> 2018-06-17 11:19:40 INFO  ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@610db97e{/storage/json,null,AVAILABLE,@Spark}
> 2018-06-17 11:19:40 INFO  ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@6f0628de{/storage/rdd,null,AVAILABLE,@Spark}
> 2018-06-17 11:19:40 INFO  ContextHandler:781 - Started 
> 

how can I run spark job in my environment which is a single Ubuntu host with no hadoop installed

2018-06-17 Thread Raymond Xie
Hello,

I am wondering how can I run spark job in my environment which is a single
Ubuntu host with no hadoop installed? if I run my job like below, I will
end up with infinite loop at the end. Thank you very much.

rxie@ubuntu:~/data$ spark-submit --class retail_db.GetRevenuePerOrder
--conf spark.ui.port=12678 spark2practice_2.11-0.1.jar yarn-client
/public/retail_db/order_items /home/rxie/output/revenueperorder
2018-06-17 11:19:36 WARN  Utils:66 - Your hostname, ubuntu resolves to a
loopback address: 127.0.1.1; using 192.168.112.141 instead (on interface
ens33)
2018-06-17 11:19:36 WARN  Utils:66 - Set SPARK_LOCAL_IP if you need to bind
to another address
2018-06-17 11:19:37 WARN  NativeCodeLoader:62 - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2018-06-17 11:19:38 INFO  SparkContext:54 - Running Spark version 2.3.1
2018-06-17 11:19:38 WARN  SparkConf:66 - spark.master yarn-client is
deprecated in Spark 2.0+, please instead use "yarn" with specified deploy
mode.
2018-06-17 11:19:38 INFO  SparkContext:54 - Submitted application: Get
Revenue Per Order
2018-06-17 11:19:38 INFO  SecurityManager:54 - Changing view acls to: rxie
2018-06-17 11:19:38 INFO  SecurityManager:54 - Changing modify acls to: rxie
2018-06-17 11:19:38 INFO  SecurityManager:54 - Changing view acls groups to:
2018-06-17 11:19:38 INFO  SecurityManager:54 - Changing modify acls groups
to:
2018-06-17 11:19:38 INFO  SecurityManager:54 - SecurityManager:
authentication disabled; ui acls disabled; users  with view permissions:
Set(rxie); groups with view permissions: Set(); users  with modify
permissions: Set(rxie); groups with modify permissions: Set()
2018-06-17 11:19:39 INFO  Utils:54 - Successfully started service
'sparkDriver' on port 44709.
2018-06-17 11:19:39 INFO  SparkEnv:54 - Registering MapOutputTracker
2018-06-17 11:19:39 INFO  SparkEnv:54 - Registering BlockManagerMaster
2018-06-17 11:19:39 INFO  BlockManagerMasterEndpoint:54 - Using
org.apache.spark.storage.DefaultTopologyMapper for getting topology
information
2018-06-17 11:19:39 INFO  BlockManagerMasterEndpoint:54 -
BlockManagerMasterEndpoint up
2018-06-17 11:19:39 INFO  DiskBlockManager:54 - Created local directory at
/tmp/blockmgr-69a8a12d-0881-4454-96ab-6a45d5c58bfe
2018-06-17 11:19:39 INFO  MemoryStore:54 - MemoryStore started with
capacity 413.9 MB
2018-06-17 11:19:39 INFO  SparkEnv:54 - Registering OutputCommitCoordinator
2018-06-17 11:19:40 INFO  log:192 - Logging initialized @7035ms
2018-06-17 11:19:40 INFO  Server:346 - jetty-9.3.z-SNAPSHOT
2018-06-17 11:19:40 INFO  Server:414 - Started @7383ms
2018-06-17 11:19:40 INFO  AbstractConnector:278 - Started
ServerConnector@51ad75c2{HTTP/1.1,[http/1.1]}{0.0.0.0:12678}
2018-06-17 11:19:40 INFO  Utils:54 - Successfully started service 'SparkUI'
on port 12678.
2018-06-17 11:19:40 INFO  ContextHandler:781 - Started
o.s.j.s.ServletContextHandler@50b8ae8d{/jobs,null,AVAILABLE,@Spark}
2018-06-17 11:19:40 INFO  ContextHandler:781 - Started
o.s.j.s.ServletContextHandler@60afd40d{/jobs/json,null,AVAILABLE,@Spark}
2018-06-17 11:19:40 INFO  ContextHandler:781 - Started
o.s.j.s.ServletContextHandler@28a2a3e7{/jobs/job,null,AVAILABLE,@Spark}
2018-06-17 11:19:40 INFO  ContextHandler:781 - Started
o.s.j.s.ServletContextHandler@10b3df93{/jobs/job/json,null,AVAILABLE,@Spark}
2018-06-17 11:19:40 INFO  ContextHandler:781 - Started
o.s.j.s.ServletContextHandler@ea27e34{/stages,null,AVAILABLE,@Spark}
2018-06-17 11:19:40 INFO  ContextHandler:781 - Started
o.s.j.s.ServletContextHandler@33a2499c{/stages/json,null,AVAILABLE,@Spark}
2018-06-17 11:19:40 INFO  ContextHandler:781 - Started
o.s.j.s.ServletContextHandler@e72dba7{/stages/stage,null,AVAILABLE,@Spark}
2018-06-17 11:19:40 INFO  ContextHandler:781 - Started
o.s.j.s.ServletContextHandler@3c321bdb
{/stages/stage/json,null,AVAILABLE,@Spark}
2018-06-17 11:19:40 INFO  ContextHandler:781 - Started
o.s.j.s.ServletContextHandler@24855019{/stages/pool,null,AVAILABLE,@Spark}
2018-06-17 11:19:40 INFO  ContextHandler:781 - Started
o.s.j.s.ServletContextHandler@3abd581e
{/stages/pool/json,null,AVAILABLE,@Spark}
2018-06-17 11:19:40 INFO  ContextHandler:781 - Started
o.s.j.s.ServletContextHandler@4d4d8fcf{/storage,null,AVAILABLE,@Spark}
2018-06-17 11:19:40 INFO  ContextHandler:781 - Started
o.s.j.s.ServletContextHandler@610db97e{/storage/json,null,AVAILABLE,@Spark}
2018-06-17 11:19:40 INFO  ContextHandler:781 - Started
o.s.j.s.ServletContextHandler@6f0628de{/storage/rdd,null,AVAILABLE,@Spark}
2018-06-17 11:19:40 INFO  ContextHandler:781 - Started
o.s.j.s.ServletContextHandler@3fabf088
{/storage/rdd/json,null,AVAILABLE,@Spark}
2018-06-17 11:19:40 INFO  ContextHandler:781 - Started
o.s.j.s.ServletContextHandler@1e392345{/environment,null,AVAILABLE,@Spark}
2018-06-17 11:19:40 INFO  ContextHandler:781 - Started
o.s.j.s.ServletContextHandler@12f3afb5
{/environment/json,null,AVAILABLE,@Spark}
2018-06-17 11:19:40 INFO  ContextHandler:781 - Started

Re: [Help] Codegen Stage grows beyond 64 KB

2018-06-17 Thread vaquar khan
Totally agreed with Eyal .

The problem is that when Java programs generated using Catalyst from
programs using DataFrame and Dataset are compiled into Java bytecode, the
size of byte code of one method must not be 64 KB or more, This conflicts
with the limitation of the Java class file, which is an exception that
occurs.

In order to avoid occurrence of an exception due to this restriction,
within Spark, a solution is to split the methods that compile and make Java
bytecode that is likely to be over 64 KB into multiple methods when
Catalyst generates Java programs It has been done.

Use persist or any other logical separation in pipeline.

Regards,
Vaquar khan

On Sun, Jun 17, 2018 at 5:25 AM, Eyal Zituny  wrote:

> Hi Akash,
> such errors might appear in large spark pipelines, the root cause is a
> 64kb jvm limitation.
> the reason that your job isn't failing at the end is due to spark fallback
> - if code gen is failing, spark compiler will try to create the flow
> without the code gen (less optimized)
> if you do not want to see this error, you can either disable code gen
> using the flag:  spark.sql.codegen.wholeStage= "false"
> or you can try to split your complex pipeline into several spark flows if
> possible
>
> hope that helps
>
> Eyal
>
> On Sun, Jun 17, 2018 at 8:16 AM, Aakash Basu 
> wrote:
>
>> Hi,
>>
>> I already went through it, that's one use case. I've a complex and very
>> big pipeline of multiple jobs under one spark session. Not getting, on how
>> to solve this, as it is happening over Logistic Regression and Random
>> Forest models, which I'm just using from Spark ML package rather than doing
>> anything by myself.
>>
>> Thanks,
>> Aakash.
>>
>> On Sun 17 Jun, 2018, 8:21 AM vaquar khan,  wrote:
>>
>>> Hi Akash,
>>>
>>> Please check stackoverflow.
>>>
>>> https://stackoverflow.com/questions/41098953/codegen-grows-
>>> beyond-64-kb-error-when-normalizing-large-pyspark-dataframe
>>>
>>> Regards,
>>> Vaquar khan
>>>
>>> On Sat, Jun 16, 2018 at 3:27 PM, Aakash Basu >> > wrote:
>>>
 Hi guys,

 I'm getting an error when I'm feature engineering on 30+ columns to
 create about 200+ columns. It is not failing the job, but the ERROR shows.
 I want to know how can I avoid this.

 Spark - 2.3.1
 Python - 3.6

 Cluster Config -
 1 Master - 32 GB RAM, 16 Cores
 4 Slaves - 16 GB RAM, 8 Cores


 Input data - 8 partitions of parquet file with snappy compression.

 My Spark-Submit -> spark-submit --master spark://192.168.60.20:7077
 --num-executors 4 --executor-cores 5 --executor-memory 10G --driver-cores 5
 --driver-memory 25G --conf spark.sql.shuffle.partitions=60 --conf
 spark.driver.maxResultSize=2G --conf 
 "spark.executor.extraJavaOptions=-XX:+UseParallelGC"
 --conf spark.scheduler.listenerbus.eventqueue.capacity=2 --conf
 spark.sql.codegen=true /appdata/bblite-codebase/pipeline_data_test_run.py
 > /appdata/bblite-data/logs/log_10_iter_pipeline_8_partitions_
 33_col.txt

 Stack-Trace below -

 ERROR CodeGenerator:91 - failed to compile:
> org.codehaus.janino.InternalCompilerException: Compiling
> "GeneratedClass": Code of method "processNext()V" of class
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$Ge
> neratedIteratorForCodegenStage3426" grows beyond 64 KB
> org.codehaus.janino.InternalCompilerException: Compiling
> "GeneratedClass": Code of method "processNext()V" of class
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$Ge
> neratedIteratorForCodegenStage3426" grows beyond 64 KB
> at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.
> java:361)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:
> 234)
> at org.codehaus.janino.SimpleCompiler.compileToClassLoader(Simp
> leCompiler.java:446)
> at org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassB
> odyEvaluator.java:313)
> at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluat
> or.java:235)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:
> 204)
> at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
> at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenera
> tor$.org$apache$spark$sql$catalyst$expressions$codegen$C
> odeGenerator$$doCompile(CodeGenerator.scala:1417)
> at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenera
> tor$$anon$1.load(CodeGenerator.scala:1493)
> at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenera
> tor$$anon$1.load(CodeGenerator.scala:1490)
> at org.spark_project.guava.cache.LocalCache$LoadingValueReferen
> ce.loadFuture(LocalCache.java:3599)
> at org.spark_project.guava.cache.LocalCache$Segment.loadSync(Lo
> calCache.java:2379)
> at org.spark_project.guava.cache.LocalCache$Segment.lockedGetOr
> 

Re: Error: Could not find or load main class org.apache.spark.launcher.Main

2018-06-17 Thread Raymond Xie
Thank you Vamshi,

Yes the path presumably has been added, here it is:

rxie@ubuntu:~/Downloads/spark$ echo $PATH
/home/rxie/Downloads/spark
:/usr/bin/java:/usr/bin/java:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/rxie/Downloads/spark:/home/rxie/Downloads/spark/bin:/usr/bin/java
rxie@ubuntu:~/Downloads/spark$ spark-shell
Error: Could not find or load main class org.apache.spark.launcher.Main



**
*Sincerely yours,*


*Raymond*

On Sun, Jun 17, 2018 at 8:44 AM, Vamshi Talla  wrote:

> Raymond,
>
>
> Is your SPARK_HOME set? In your .bash_profile, try setting the below:
>
> export SPARK_HOME=/home/Downloads/spark (or wherever your spark is
> downloaded to)
>
> once done, source your .bash_profile or restart the shell and try
> spark-shell
>
>
> Best Regards,
>
> Vamshi T
>
>
> --
> *From:* Raymond Xie 
> *Sent:* Sunday, June 17, 2018 6:27 AM
> *To:* user; Hui Xie
> *Subject:* Error: Could not find or load main class
> org.apache.spark.launcher.Main
>
> Hello,
>
> It would be really appreciated if anyone can help sort it out the
> following path issue for me? I highly doubt this is related to missing path
> setting but don't know how can I fix it.
>
>
>
> rxie@ubuntu:~/Downloads/spark$ echo $PATH
> /usr/bin/java:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/
> bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/rxie/Downloads/s
>
> park:/home/rxie/Downloads/spark/bin:/usr/bin/java
> rxie@ubuntu:~/Downloads/spark$ pyspark
> Error: Could not find or load main class org.apache.spark.launcher.Main
> rxie@ubuntu:~/Downloads/spark$ spark-shell
> Error: Could not find or load main class org.apache.spark.launcher.Main
> rxie@ubuntu:~/Downloads/spark$ pwd
> /home/rxie/Downloads/spark
> rxie@ubuntu:~/Downloads/spark$ ls
> bin  conf  data  examples  jars  kubernetes  licenses  R  yarn
>
>
> **
> *Sincerely yours,*
>
>
> *Raymond*
>


Re: Error: Could not find or load main class org.apache.spark.launcher.Main

2018-06-17 Thread Vamshi Talla
Raymond,


Is your SPARK_HOME set? In your .bash_profile, try setting the below:

export SPARK_HOME=/home/Downloads/spark (or wherever your spark is downloaded 
to)

once done, source your .bash_profile or restart the shell and try spark-shell


Best Regards,

Vamshi T



From: Raymond Xie 
Sent: Sunday, June 17, 2018 6:27 AM
To: user; Hui Xie
Subject: Error: Could not find or load main class org.apache.spark.launcher.Main

Hello,

It would be really appreciated if anyone can help sort it out the following 
path issue for me? I highly doubt this is related to missing path setting but 
don't know how can I fix it.



rxie@ubuntu:~/Downloads/spark$ echo $PATH
/usr/bin/java:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/rxie/Downloads/s

  park:/home/rxie/Downloads/spark/bin:/usr/bin/java
rxie@ubuntu:~/Downloads/spark$ pyspark
Error: Could not find or load main class org.apache.spark.launcher.Main
rxie@ubuntu:~/Downloads/spark$ spark-shell
Error: Could not find or load main class org.apache.spark.launcher.Main
rxie@ubuntu:~/Downloads/spark$ pwd
/home/rxie/Downloads/spark
rxie@ubuntu:~/Downloads/spark$ ls
bin  conf  data  examples  jars  kubernetes  licenses  R  yarn



Sincerely yours,


Raymond


Error: Could not find or load main class org.apache.spark.launcher.Main

2018-06-17 Thread Raymond Xie
Hello,

It would be really appreciated if anyone can help sort it out the following
path issue for me? I highly doubt this is related to missing path setting
but don't know how can I fix it.



rxie@ubuntu:~/Downloads/spark$ echo $PATH
/usr/bin/java:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/rxie/Downloads/s

park:/home/rxie/Downloads/spark/bin:/usr/bin/java
rxie@ubuntu:~/Downloads/spark$ pyspark
Error: Could not find or load main class org.apache.spark.launcher.Main
rxie@ubuntu:~/Downloads/spark$ spark-shell
Error: Could not find or load main class org.apache.spark.launcher.Main
rxie@ubuntu:~/Downloads/spark$ pwd
/home/rxie/Downloads/spark
rxie@ubuntu:~/Downloads/spark$ ls
bin  conf  data  examples  jars  kubernetes  licenses  R  yarn


**
*Sincerely yours,*


*Raymond*


spark-shell doesn't start

2018-06-17 Thread Raymond Xie
Hello, I am doing the practice in Ubuntu now, here is the error I am
encountering:


rxie@ubuntu:~/Downloads/spark/bin$ spark-shell
Error: Could not find or load main class org.apache.spark.launcher.Main


What am I missing?

Thank you very much.

Java is installed.

**
*Sincerely yours,*


*Raymond*


Re: spark-submit Error: Cannot load main class from JAR file

2018-06-17 Thread Vamshi Talla
Hi Raymond,


I see that you can make a small correction in your spark-submit command. Your 
spark-submit command should say:


spark-submit --master local --class . < Jar 
Location and JarName> 


Example:


spark-submit --master local \

--class retail_db.GetRevenuePerOrder 
C:\RXIE\Learning\Scala\spark2practice\target\scala-2.11\spark2practice_2.11-0.1.jar
 \

C:\RXIE\Learning\Data\data-master\retail_db\order_items \
C:\RXIE\Learning\Data\data-master\retail_db\order_items\revenue_per_order



Also, I believe you can "\" to split the commands in the multiple lines even on 
windows


Best Regards,
Vamshi T


From: Raymond Xie 
Sent: Sunday, June 17, 2018 5:07 AM
To: user; Hui Xie
Subject: spark-submit Error: Cannot load main class from JAR file

Hello, I am doing the practice in windows now.
I have the jar file generated under:
C:\RXIE\Learning\Scala\spark2practice\target\scala-2.11\spark2practice_2.11-0.1.jar

The package name is Retail_db and the object is GetRevenuePerOrder.

The spark-submit command is:

spark-submit retail_db.GetRevenuePerOrder --class 
"C:\RXIE\Learning\Scala\spark2practice\target\scala-2.11\spark2practice_2.11-0.1.jar"
 local   C:\RXIE\Learning\Data\data-master\retail_db\order_items 
C:\C:\RXIE\Learning\Data\data-master\retail_db\order_items\revenue_per_order

When executing this command, it throws out the error:
Error: Cannot load main class from JAR file

What is the right submit command should I write here?


Thank you very much.

By the way, the command is really too long, how can I split it into multiple 
lines like '\' in Linux?





Sincerely yours,


Raymond


spark-submit Error: Cannot load main class from JAR file

2018-06-17 Thread Raymond Xie
Hello, I am doing the practice in windows now.
I have the jar file generated under:
C:\RXIE\Learning\Scala\spark2practice\target\scala-2.
11\spark2practice_2.11-0.1.jar

The package name is Retail_db and the object is GetRevenuePerOrder.

The spark-submit command is:

spark-submit retail_db.GetRevenuePerOrder --class "C:\RXIE\Learning\Scala\
spark2practice\target\scala-2.11\spark2practice_2.11-0.1.jar" local
 C:\RXIE\Learning\Data\data-master\retail_db\order_items
C:\C:\RXIE\Learning\Data\data-master\retail_db\order_items\revenue_per_order

When executing this command, it throws out the error:
Error: Cannot load main class from JAR file

What is the right submit command should I write here?


Thank you very much.

By the way, the command is really too long, how can I split it into
multiple lines like '\' in Linux?




**
*Sincerely yours,*


*Raymond*


Re: [Help] Codegen Stage grows beyond 64 KB

2018-06-17 Thread Eyal Zituny
Hi Akash,
such errors might appear in large spark pipelines, the root cause is a 64kb
jvm limitation.
the reason that your job isn't failing at the end is due to spark fallback
- if code gen is failing, spark compiler will try to create the flow
without the code gen (less optimized)
if you do not want to see this error, you can either disable code gen using
the flag:  spark.sql.codegen.wholeStage= "false"
or you can try to split your complex pipeline into several spark flows if
possible

hope that helps

Eyal

On Sun, Jun 17, 2018 at 8:16 AM, Aakash Basu 
wrote:

> Hi,
>
> I already went through it, that's one use case. I've a complex and very
> big pipeline of multiple jobs under one spark session. Not getting, on how
> to solve this, as it is happening over Logistic Regression and Random
> Forest models, which I'm just using from Spark ML package rather than doing
> anything by myself.
>
> Thanks,
> Aakash.
>
> On Sun 17 Jun, 2018, 8:21 AM vaquar khan,  wrote:
>
>> Hi Akash,
>>
>> Please check stackoverflow.
>>
>> https://stackoverflow.com/questions/41098953/codegen-
>> grows-beyond-64-kb-error-when-normalizing-large-pyspark-dataframe
>>
>> Regards,
>> Vaquar khan
>>
>> On Sat, Jun 16, 2018 at 3:27 PM, Aakash Basu 
>> wrote:
>>
>>> Hi guys,
>>>
>>> I'm getting an error when I'm feature engineering on 30+ columns to
>>> create about 200+ columns. It is not failing the job, but the ERROR shows.
>>> I want to know how can I avoid this.
>>>
>>> Spark - 2.3.1
>>> Python - 3.6
>>>
>>> Cluster Config -
>>> 1 Master - 32 GB RAM, 16 Cores
>>> 4 Slaves - 16 GB RAM, 8 Cores
>>>
>>>
>>> Input data - 8 partitions of parquet file with snappy compression.
>>>
>>> My Spark-Submit -> spark-submit --master spark://192.168.60.20:7077
>>> --num-executors 4 --executor-cores 5 --executor-memory 10G --driver-cores 5
>>> --driver-memory 25G --conf spark.sql.shuffle.partitions=60 --conf
>>> spark.driver.maxResultSize=2G --conf "spark.executor.
>>> extraJavaOptions=-XX:+UseParallelGC" --conf 
>>> spark.scheduler.listenerbus.eventqueue.capacity=2
>>> --conf spark.sql.codegen=true 
>>> /appdata/bblite-codebase/pipeline_data_test_run.py
>>> > /appdata/bblite-data/logs/log_10_iter_pipeline_8_partitions_33_col.txt
>>>
>>> Stack-Trace below -
>>>
>>> ERROR CodeGenerator:91 - failed to compile: 
>>> org.codehaus.janino.InternalCompilerException:
 Compiling "GeneratedClass": Code of method "processNext()V" of class
 "org.apache.spark.sql.catalyst.expressions.GeneratedClass$
 GeneratedIteratorForCodegenStage3426" grows beyond 64 KB
 org.codehaus.janino.InternalCompilerException: Compiling
 "GeneratedClass": Code of method "processNext()V" of class
 "org.apache.spark.sql.catalyst.expressions.GeneratedClass$
 GeneratedIteratorForCodegenStage3426" grows beyond 64 KB
 at org.codehaus.janino.UnitCompiler.compileUnit(
 UnitCompiler.java:361)
 at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
 at org.codehaus.janino.SimpleCompiler.compileToClassLoader(
 SimpleCompiler.java:446)
 at org.codehaus.janino.ClassBodyEvaluator.compileToClass(
 ClassBodyEvaluator.java:313)
 at org.codehaus.janino.ClassBodyEvaluator.cook(
 ClassBodyEvaluator.java:235)
 at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
 at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
 at org.apache.spark.sql.catalyst.expressions.codegen.
 CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$
 CodeGenerator$$doCompile(CodeGenerator.scala:1417)
 at org.apache.spark.sql.catalyst.expressions.codegen.
 CodeGenerator$$anon$1.load(CodeGenerator.scala:1493)
 at org.apache.spark.sql.catalyst.expressions.codegen.
 CodeGenerator$$anon$1.load(CodeGenerator.scala:1490)
 at org.spark_project.guava.cache.LocalCache$LoadingValueReference.
 loadFuture(LocalCache.java:3599)
 at org.spark_project.guava.cache.LocalCache$Segment.loadSync(
 LocalCache.java:2379)
 at org.spark_project.guava.cache.LocalCache$Segment.
 lockedGetOrLoad(LocalCache.java:2342)
 at org.spark_project.guava.cache.LocalCache$Segment.get(
 LocalCache.java:2257)
 at org.spark_project.guava.cache.LocalCache.get(LocalCache.
 java:4000)
 at org.spark_project.guava.cache.LocalCache.getOrLoad(
 LocalCache.java:4004)
 at org.spark_project.guava.cache.LocalCache$LocalLoadingCache.
 get(LocalCache.java:4874)
 at org.apache.spark.sql.catalyst.expressions.codegen.
 CodeGenerator$.compile(CodeGenerator.scala:1365)
 at org.apache.spark.sql.execution.WholeStageCodegenExec.
 liftedTree1$1(WholeStageCodegenExec.scala:579)
 at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(
 WholeStageCodegenExec.scala:578)
 at org.apache.spark.sql.execution.SparkPlan$$anonfun$
 execute$1.apply(SparkPlan.scala:131)