[Spark SQL] Can explode array of structs in correlated subquery
I'm not sure how to describe this scenario in words, let's see some example SQL. Given the table schema: create table customer ( c_custkey bigint, c_namestring, c_orders array> ) Now I want to know each customer's `avg(o_totalprice)`. Maybe I can use `explode(c_orders)` to unpack this array, but this method need a extra `group by`. Impala has a intersting feature to support this: SELECT c_name, average_price FROM customer c, (SELECT AVG(o_totalprice) average_price FROM c.c_orders) subq1 I can use `map` to do the same thing, but the Impala way can utilize Parquet's projection push-down better. When I use `map`, optimizer has no idea about which fields I will use, so Spark will read the whole struct. The Impala way let optimizer know which fields I'm actually using, and reduce the read size. Although [SPARK-4502] point out Spark's optimizer doesn't use this Parquet feature, but it seems will be fixed in next major version. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: how can I run spark job in my environment which is a single Ubuntu host with no hadoop installed
Maybe your application is overriding the master variable when it creates its SparkContext. I see you are still passing “yarn-client” as an argument later to it in your command. > On Jun 17, 2018, at 11:53 AM, Raymond Xie wrote: > > Thank you Subhash. > > Here is the new command: > spark-submit --master local[*] --class retail_db.GetRevenuePerOrder --conf > spark.ui.port=12678 spark2practice_2.11-0.1.jar yarn-client > /public/retail_db/order_items /home/rxie/output/revenueperorder > > Still seeing the same issue here. > 2018-06-17 11:51:25 INFO RMProxy:98 - Connecting to ResourceManager at > /0.0.0.0:8032 > 2018-06-17 11:51:27 INFO Client:871 - Retrying connect to server: > 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is > >RetryUpToMaximumCountWithFixedSleep(maxRetries=10, > sleepTime=1000 MILLISECONDS) > 2018-06-17 11:51:28 INFO Client:871 - Retrying connect to server: > 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is > >RetryUpToMaximumCountWithFixedSleep(maxRetries=10, > sleepTime=1000 MILLISECONDS) > 2018-06-17 11:51:29 INFO Client:871 - Retrying connect to server: > 0.0.0.0/0.0.0.0:8032. Already tried 2 time(s); retry policy is > >RetryUpToMaximumCountWithFixedSleep(maxRetries=10, > sleepTime=1000 MILLISECONDS) > 2018-06-17 11:51:30 INFO Client:871 - Retrying connect to server: > 0.0.0.0/0.0.0.0:8032. Already tried 3 time(s); retry policy is > >RetryUpToMaximumCountWithFixedSleep(maxRetries=10, > sleepTime=1000 MILLISECONDS) > 2018-06-17 11:51:31 INFO Client:871 - Retrying connect to server: > 0.0.0.0/0.0.0.0:8032. Already tried 4 time(s); retry policy is > >RetryUpToMaximumCountWithFixedSleep(maxRetries=10, > sleepTime=1000 MILLISECONDS) > 2018-06-17 11:51:32 INFO Client:871 - Retrying connect to server: > 0.0.0.0/0.0.0.0:8032. Already tried 5 time(s); retry policy is > >RetryUpToMaximumCountWithFixedSleep(maxRetries=10, > sleepTime=1000 MILLISECONDS) > 2018-06-17 11:51:33 INFO Client:871 - Retrying connect to server: > 0.0.0.0/0.0.0.0:8032. Already tried 6 time(s); retry policy is > >RetryUpToMaximumCountWithFixedSleep(maxRetries=10, > sleepTime=1000 MILLISECONDS) > 2018-06-17 11:51:34 INFO Client:871 - Retrying connect to server: > 0.0.0.0/0.0.0.0:8032. Already tried 7 time(s); retry policy is > >RetryUpToMaximumCountWithFixedSleep(maxRetries=10, > sleepTime=1000 MILLISECONDS) > 2018-06-17 11:51:35 INFO Client:871 - Retrying connect to server: > 0.0.0.0/0.0.0.0:8032. Already tried 8 time(s); retry policy is > >RetryUpToMaximumCountWithFixedSleep(maxRetries=10, > sleepTime=1000 MILLISECONDS) > 2018-06-17 11:51:36 INFO Client:871 - Retrying connect to server: > 0.0.0.0/0.0.0.0:8032. Already tried 9 time(s); retry policy is > >RetryUpToMaximumCountWithFixedSleep(maxRetries=10, > sleepTime=1000 MILLISECONDS) > > > > > Sincerely yours, > > > Raymond > > On Sun, Jun 17, 2018 at 2:36 PM, Subhash Sriram > wrote: > Hi Raymond, > > If you set your master to local[*] instead of yarn-client, it should run on > your local machine. > > Thanks, > Subhash > > Sent from my iPhone > > On Jun 17, 2018, at 2:32 PM, Raymond Xie wrote: > >> Hello, >> >> I am wondering how can I run spark job in my environment which is a single >> Ubuntu host with no hadoop installed? if I run my job like below, I will end >> up with infinite loop at the end. Thank you very much. >> >> rxie@ubuntu:~/data$ spark-submit --class retail_db.GetRevenuePerOrder --conf >> spark.ui.port=12678 spark2practice_2.11-0.1.jar yarn-client >> /public/retail_db/order_items /home/rxie/output/revenueperorder >> 2018-06-17 11:19:36 WARN Utils:66 - Your hostname, ubuntu resolves to a >> loopback address: 127.0.1.1; using 192.168.112.141 instead (on interface >> ens33) >> 2018-06-17 11:19:36 WARN Utils:66 - Set SPARK_LOCAL_IP
[Spark-sql Dataset] .as[SomeClass] not modifying Physical Plan
Hi everyone, I am trying to understand the behaviour of .as[SomeClass] (Dataset API): Say I have a file with Users: case class User(id: Int, name: String, address: String, date_add: java.sql.Date) val users = sc.parallelize(Stream.fill(100)(User(0, "test", "Test Street", new java.sql.Date(0, 0, 0.toDF() Say this is a Parquet file in my data-lake and I created a case class (User) that defines all the fields in that file. Now I want to work with only a subset of the fields for a given pipeline, and I want to take advantage of the Columnar format the file is saved as (reading a subset of columns); but I don’t want to lose the Dataset API. So I create a sub-case class defining only the few fields I need: case class UserSub(id: Int, name: String) Now my first surprise is that when I run: users.as[UserSub].show() I get: +---++---+--+ | id|name|address| date_add| +---++---+--+ | 0|test|Test Street|1899-12-31| | 0|test|Test Street|1899-12-31| ... Which I understand as: If no operation is called after `as[UserSub]`, Spark will not serialise the rows so the extra fields I’m not interested in will not be dropped. I can confirm it by running an explain: users.as[UserSub].explain == Physical Plan == *(1) SerializeFromObject [assertnotnull(input[0, $line34.$read$$iw$$iw$User, true]).id AS id#69, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line34.$read$$iw$$iw$User, true]).name, true, false) AS name#70, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line34.$read$$iw$$iw$User, true]).address, true, false) AS address#71, staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, DateType, fromJavaDate, assertnotnull(input[0, $line34.$read$$iw$$iw$User, true]).date_add, true, false) AS date_add#72] +- Scan ExternalRDDScan[obj#68] Which shows that the execution plan was not modified. Now if I run: users.as[UserSub].map(x => x).show() +---++ | id|name| +---++ | 0|test| | 0|test| ... The fields were dropped (so I got the end-result that I wanted). Now if I run an explain: users.as[UserSub].map(x => x).explain == Physical Plan == *(1) SerializeFromObject [assertnotnull(input[0, $line37.$read$$iw$$iw$UserSub, true]).id AS id#141, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line37.$read$$iw$$iw$UserSub, true]).name, true, false) AS name#142] +- *(1) MapElements , obj#140: $line37.$read$$iw$$iw$UserSub +- *(1) DeserializeToObject newInstance(class $line37.$read$$iw$$iw$UserSub), obj#139: $line37.$read$$iw$$iw$UserSub +- *(1) Project [id#69, name#70] +- *(1) SerializeFromObject [assertnotnull(input[0, $line34.$read$$iw$$iw$User, true]).id AS id#69, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line34.$read$$iw$$iw$User, true]).name, true, false) AS name#70, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line34.$read$$iw$$iw$User, true]).address, true, false) AS address#71, staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, DateType, fromJavaDate, assertnotnull(input[0, $line34.$read$$iw$$iw$User, true]).date_add, true, false) AS date_add#72] +- Scan ExternalRDDScan[obj#68] I see that in that case the physical plan changed, the line I’m interested in is Project [id#69, name#70] Which is selecting individual columns just like in the DataFrame API: users.select("id", "name").explain() == Physical Plan == *(1) Project [id#69, name#70] +- *(1) SerializeFromObject [assertnotnull(input[0, $line34.$read$$iw$$iw$User, true]).id AS id#69, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line34.$read$$iw$$iw$User, true]).name, true, false) AS name#70, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line34.$read$$iw$$iw$User, true]).address, true, false) AS address#71, staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, DateType, fromJavaDate, assertnotnull(input[0, $line34.$read$$iw$$iw$User, true]).date_add, true, false) AS date_add#72] +- Scan ExternalRDDScan[obj#68] Am I expecting the wrong things from `.as[SomeClass]` ? Why wouldn’t it by default add a Project to the Query Plan ? Best regards, Daniel Mateus Pires Data Engineer
making query state checkpoint compatible in structured streaming
Consider there is a spark query(A) which is dependent on Kafka topics t1 and t2. After running this query in the streaming mode, a checkpoint(C1) directory for the query gets created with offsets and sources directories. Now I add a third topic(t3) on which the query is dependent. Now if I restart spark with the same checkpoint C1, Spark crashes as expected, as it could not find the entry for the third topic(t3). So just as part of a hack, I tried to add the topic t3 to the checkpoint manually to the sources and offset directories of the query in the checkpoint. But spark still crashed. Whats the correct way to solve this problem? How to handle such upgrade paths in structured streaming? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: how can I run spark job in my environment which is a single Ubuntu host with no hadoop installed
Thank you Subhash. Here is the new command: spark-submit --master local[*] --class retail_db.GetRevenuePerOrder --conf spark.ui.port=12678 spark2practice_2.11-0.1.jar yarn-client /public/retail_db/order_items /home/rxie/output/revenueperorder Still seeing the same issue here. 2018-06-17 11:51:25 INFO RMProxy:98 - Connecting to ResourceManager at / 0.0.0.0:8032 2018-06-17 11:51:27 INFO Client:871 - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2018-06-17 11:51:28 INFO Client:871 - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2018-06-17 11:51:29 INFO Client:871 - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2018-06-17 11:51:30 INFO Client:871 - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2018-06-17 11:51:31 INFO Client:871 - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2018-06-17 11:51:32 INFO Client:871 - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2018-06-17 11:51:33 INFO Client:871 - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2018-06-17 11:51:34 INFO Client:871 - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2018-06-17 11:51:35 INFO Client:871 - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2018-06-17 11:51:36 INFO Client:871 - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) ** *Sincerely yours,* *Raymond* On Sun, Jun 17, 2018 at 2:36 PM, Subhash Sriram wrote: > Hi Raymond, > > If you set your master to local[*] instead of yarn-client, it should run > on your local machine. > > Thanks, > Subhash > > Sent from my iPhone > > On Jun 17, 2018, at 2:32 PM, Raymond Xie wrote: > > Hello, > > I am wondering how can I run spark job in my environment which is a single > Ubuntu host with no hadoop installed? if I run my job like below, I will > end up with infinite loop at the end. Thank you very much. > > rxie@ubuntu:~/data$ spark-submit --class retail_db.GetRevenuePerOrder > --conf spark.ui.port=12678 spark2practice_2.11-0.1.jar yarn-client > /public/retail_db/order_items /home/rxie/output/revenueperorder > 2018-06-17 11:19:36 WARN Utils:66 - Your hostname, ubuntu resolves to a > loopback address: 127.0.1.1; using 192.168.112.141 instead (on interface > ens33) > 2018-06-17 11:19:36 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to > bind to another address > 2018-06-17 11:19:37 WARN NativeCodeLoader:62 - Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > 2018-06-17 11:19:38 INFO SparkContext:54 - Running Spark version 2.3.1 > 2018-06-17 11:19:38 WARN SparkConf:66 - spark.master yarn-client is > deprecated in Spark 2.0+, please instead use "yarn" with specified deploy > mode. > 2018-06-17 11:19:38 INFO SparkContext:54 - Submitted application: Get > Revenue Per Order > 2018-06-17 11:19:38 INFO SecurityManager:54 - Changing view acls to: rxie > 2018-06-17 11:19:38 INFO SecurityManager:54 - Changing modify acls to: > rxie > 2018-06-17 11:19:38 INFO SecurityManager:54 - Changing view acls groups > to: > 2018-06-17 11:19:38 INFO SecurityManager:54 - Changing modify acls groups > to: > 2018-06-17 11:19:38 INFO SecurityManager:54 - SecurityManager: > authentication disabled; ui acls disabled; users with view permissions: > Set(rxie); groups with view permissions: Set(); users with modify > permissions: Set(rxie); groups with modify permissions: Set() > 2018-06-17 11:19:39 INFO Utils:54 - Successfully started service > 'sparkDriver' on port 44709. > 2018-06-17 11:19:39 INFO SparkEnv:54 - Registering MapOutputTracker > 2018-06-17 11:19:39 INFO
Re: how can I run spark job in my environment which is a single Ubuntu host with no hadoop installed
Hi Raymond, If you set your master to local[*] instead of yarn-client, it should run on your local machine. Thanks, Subhash Sent from my iPhone > On Jun 17, 2018, at 2:32 PM, Raymond Xie wrote: > > Hello, > > I am wondering how can I run spark job in my environment which is a single > Ubuntu host with no hadoop installed? if I run my job like below, I will end > up with infinite loop at the end. Thank you very much. > > rxie@ubuntu:~/data$ spark-submit --class retail_db.GetRevenuePerOrder --conf > spark.ui.port=12678 spark2practice_2.11-0.1.jar yarn-client > /public/retail_db/order_items /home/rxie/output/revenueperorder > 2018-06-17 11:19:36 WARN Utils:66 - Your hostname, ubuntu resolves to a > loopback address: 127.0.1.1; using 192.168.112.141 instead (on interface > ens33) > 2018-06-17 11:19:36 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to bind > to another address > 2018-06-17 11:19:37 WARN NativeCodeLoader:62 - Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 2018-06-17 11:19:38 INFO SparkContext:54 - Running Spark version 2.3.1 > 2018-06-17 11:19:38 WARN SparkConf:66 - spark.master yarn-client is > deprecated in Spark 2.0+, please instead use "yarn" with specified deploy > mode. > 2018-06-17 11:19:38 INFO SparkContext:54 - Submitted application: Get > Revenue Per Order > 2018-06-17 11:19:38 INFO SecurityManager:54 - Changing view acls to: rxie > 2018-06-17 11:19:38 INFO SecurityManager:54 - Changing modify acls to: rxie > 2018-06-17 11:19:38 INFO SecurityManager:54 - Changing view acls groups to: > 2018-06-17 11:19:38 INFO SecurityManager:54 - Changing modify acls groups to: > 2018-06-17 11:19:38 INFO SecurityManager:54 - SecurityManager: > authentication disabled; ui acls disabled; users with view permissions: > Set(rxie); groups with view permissions: Set(); users with modify > permissions: Set(rxie); groups with modify permissions: Set() > 2018-06-17 11:19:39 INFO Utils:54 - Successfully started service > 'sparkDriver' on port 44709. > 2018-06-17 11:19:39 INFO SparkEnv:54 - Registering MapOutputTracker > 2018-06-17 11:19:39 INFO SparkEnv:54 - Registering BlockManagerMaster > 2018-06-17 11:19:39 INFO BlockManagerMasterEndpoint:54 - Using > org.apache.spark.storage.DefaultTopologyMapper for getting topology > information > 2018-06-17 11:19:39 INFO BlockManagerMasterEndpoint:54 - > BlockManagerMasterEndpoint up > 2018-06-17 11:19:39 INFO DiskBlockManager:54 - Created local directory at > /tmp/blockmgr-69a8a12d-0881-4454-96ab-6a45d5c58bfe > 2018-06-17 11:19:39 INFO MemoryStore:54 - MemoryStore started with capacity > 413.9 MB > 2018-06-17 11:19:39 INFO SparkEnv:54 - Registering OutputCommitCoordinator > 2018-06-17 11:19:40 INFO log:192 - Logging initialized @7035ms > 2018-06-17 11:19:40 INFO Server:346 - jetty-9.3.z-SNAPSHOT > 2018-06-17 11:19:40 INFO Server:414 - Started @7383ms > 2018-06-17 11:19:40 INFO AbstractConnector:278 - Started > ServerConnector@51ad75c2{HTTP/1.1,[http/1.1]}{0.0.0.0:12678} > 2018-06-17 11:19:40 INFO Utils:54 - Successfully started service 'SparkUI' > on port 12678. > 2018-06-17 11:19:40 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@50b8ae8d{/jobs,null,AVAILABLE,@Spark} > 2018-06-17 11:19:40 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@60afd40d{/jobs/json,null,AVAILABLE,@Spark} > 2018-06-17 11:19:40 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@28a2a3e7{/jobs/job,null,AVAILABLE,@Spark} > 2018-06-17 11:19:40 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@10b3df93{/jobs/job/json,null,AVAILABLE,@Spark} > 2018-06-17 11:19:40 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@ea27e34{/stages,null,AVAILABLE,@Spark} > 2018-06-17 11:19:40 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@33a2499c{/stages/json,null,AVAILABLE,@Spark} > 2018-06-17 11:19:40 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@e72dba7{/stages/stage,null,AVAILABLE,@Spark} > 2018-06-17 11:19:40 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@3c321bdb{/stages/stage/json,null,AVAILABLE,@Spark} > 2018-06-17 11:19:40 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@24855019{/stages/pool,null,AVAILABLE,@Spark} > 2018-06-17 11:19:40 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@3abd581e{/stages/pool/json,null,AVAILABLE,@Spark} > 2018-06-17 11:19:40 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@4d4d8fcf{/storage,null,AVAILABLE,@Spark} > 2018-06-17 11:19:40 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@610db97e{/storage/json,null,AVAILABLE,@Spark} > 2018-06-17 11:19:40 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@6f0628de{/storage/rdd,null,AVAILABLE,@Spark} > 2018-06-17 11:19:40 INFO ContextHandler:781 - Started >
how can I run spark job in my environment which is a single Ubuntu host with no hadoop installed
Hello, I am wondering how can I run spark job in my environment which is a single Ubuntu host with no hadoop installed? if I run my job like below, I will end up with infinite loop at the end. Thank you very much. rxie@ubuntu:~/data$ spark-submit --class retail_db.GetRevenuePerOrder --conf spark.ui.port=12678 spark2practice_2.11-0.1.jar yarn-client /public/retail_db/order_items /home/rxie/output/revenueperorder 2018-06-17 11:19:36 WARN Utils:66 - Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 192.168.112.141 instead (on interface ens33) 2018-06-17 11:19:36 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address 2018-06-17 11:19:37 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2018-06-17 11:19:38 INFO SparkContext:54 - Running Spark version 2.3.1 2018-06-17 11:19:38 WARN SparkConf:66 - spark.master yarn-client is deprecated in Spark 2.0+, please instead use "yarn" with specified deploy mode. 2018-06-17 11:19:38 INFO SparkContext:54 - Submitted application: Get Revenue Per Order 2018-06-17 11:19:38 INFO SecurityManager:54 - Changing view acls to: rxie 2018-06-17 11:19:38 INFO SecurityManager:54 - Changing modify acls to: rxie 2018-06-17 11:19:38 INFO SecurityManager:54 - Changing view acls groups to: 2018-06-17 11:19:38 INFO SecurityManager:54 - Changing modify acls groups to: 2018-06-17 11:19:38 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(rxie); groups with view permissions: Set(); users with modify permissions: Set(rxie); groups with modify permissions: Set() 2018-06-17 11:19:39 INFO Utils:54 - Successfully started service 'sparkDriver' on port 44709. 2018-06-17 11:19:39 INFO SparkEnv:54 - Registering MapOutputTracker 2018-06-17 11:19:39 INFO SparkEnv:54 - Registering BlockManagerMaster 2018-06-17 11:19:39 INFO BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 2018-06-17 11:19:39 INFO BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up 2018-06-17 11:19:39 INFO DiskBlockManager:54 - Created local directory at /tmp/blockmgr-69a8a12d-0881-4454-96ab-6a45d5c58bfe 2018-06-17 11:19:39 INFO MemoryStore:54 - MemoryStore started with capacity 413.9 MB 2018-06-17 11:19:39 INFO SparkEnv:54 - Registering OutputCommitCoordinator 2018-06-17 11:19:40 INFO log:192 - Logging initialized @7035ms 2018-06-17 11:19:40 INFO Server:346 - jetty-9.3.z-SNAPSHOT 2018-06-17 11:19:40 INFO Server:414 - Started @7383ms 2018-06-17 11:19:40 INFO AbstractConnector:278 - Started ServerConnector@51ad75c2{HTTP/1.1,[http/1.1]}{0.0.0.0:12678} 2018-06-17 11:19:40 INFO Utils:54 - Successfully started service 'SparkUI' on port 12678. 2018-06-17 11:19:40 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@50b8ae8d{/jobs,null,AVAILABLE,@Spark} 2018-06-17 11:19:40 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@60afd40d{/jobs/json,null,AVAILABLE,@Spark} 2018-06-17 11:19:40 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@28a2a3e7{/jobs/job,null,AVAILABLE,@Spark} 2018-06-17 11:19:40 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@10b3df93{/jobs/job/json,null,AVAILABLE,@Spark} 2018-06-17 11:19:40 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@ea27e34{/stages,null,AVAILABLE,@Spark} 2018-06-17 11:19:40 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@33a2499c{/stages/json,null,AVAILABLE,@Spark} 2018-06-17 11:19:40 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@e72dba7{/stages/stage,null,AVAILABLE,@Spark} 2018-06-17 11:19:40 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3c321bdb {/stages/stage/json,null,AVAILABLE,@Spark} 2018-06-17 11:19:40 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@24855019{/stages/pool,null,AVAILABLE,@Spark} 2018-06-17 11:19:40 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3abd581e {/stages/pool/json,null,AVAILABLE,@Spark} 2018-06-17 11:19:40 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4d4d8fcf{/storage,null,AVAILABLE,@Spark} 2018-06-17 11:19:40 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@610db97e{/storage/json,null,AVAILABLE,@Spark} 2018-06-17 11:19:40 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6f0628de{/storage/rdd,null,AVAILABLE,@Spark} 2018-06-17 11:19:40 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3fabf088 {/storage/rdd/json,null,AVAILABLE,@Spark} 2018-06-17 11:19:40 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1e392345{/environment,null,AVAILABLE,@Spark} 2018-06-17 11:19:40 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@12f3afb5 {/environment/json,null,AVAILABLE,@Spark} 2018-06-17 11:19:40 INFO ContextHandler:781 - Started
Re: [Help] Codegen Stage grows beyond 64 KB
Totally agreed with Eyal . The problem is that when Java programs generated using Catalyst from programs using DataFrame and Dataset are compiled into Java bytecode, the size of byte code of one method must not be 64 KB or more, This conflicts with the limitation of the Java class file, which is an exception that occurs. In order to avoid occurrence of an exception due to this restriction, within Spark, a solution is to split the methods that compile and make Java bytecode that is likely to be over 64 KB into multiple methods when Catalyst generates Java programs It has been done. Use persist or any other logical separation in pipeline. Regards, Vaquar khan On Sun, Jun 17, 2018 at 5:25 AM, Eyal Zituny wrote: > Hi Akash, > such errors might appear in large spark pipelines, the root cause is a > 64kb jvm limitation. > the reason that your job isn't failing at the end is due to spark fallback > - if code gen is failing, spark compiler will try to create the flow > without the code gen (less optimized) > if you do not want to see this error, you can either disable code gen > using the flag: spark.sql.codegen.wholeStage= "false" > or you can try to split your complex pipeline into several spark flows if > possible > > hope that helps > > Eyal > > On Sun, Jun 17, 2018 at 8:16 AM, Aakash Basu > wrote: > >> Hi, >> >> I already went through it, that's one use case. I've a complex and very >> big pipeline of multiple jobs under one spark session. Not getting, on how >> to solve this, as it is happening over Logistic Regression and Random >> Forest models, which I'm just using from Spark ML package rather than doing >> anything by myself. >> >> Thanks, >> Aakash. >> >> On Sun 17 Jun, 2018, 8:21 AM vaquar khan, wrote: >> >>> Hi Akash, >>> >>> Please check stackoverflow. >>> >>> https://stackoverflow.com/questions/41098953/codegen-grows- >>> beyond-64-kb-error-when-normalizing-large-pyspark-dataframe >>> >>> Regards, >>> Vaquar khan >>> >>> On Sat, Jun 16, 2018 at 3:27 PM, Aakash Basu >> > wrote: >>> Hi guys, I'm getting an error when I'm feature engineering on 30+ columns to create about 200+ columns. It is not failing the job, but the ERROR shows. I want to know how can I avoid this. Spark - 2.3.1 Python - 3.6 Cluster Config - 1 Master - 32 GB RAM, 16 Cores 4 Slaves - 16 GB RAM, 8 Cores Input data - 8 partitions of parquet file with snappy compression. My Spark-Submit -> spark-submit --master spark://192.168.60.20:7077 --num-executors 4 --executor-cores 5 --executor-memory 10G --driver-cores 5 --driver-memory 25G --conf spark.sql.shuffle.partitions=60 --conf spark.driver.maxResultSize=2G --conf "spark.executor.extraJavaOptions=-XX:+UseParallelGC" --conf spark.scheduler.listenerbus.eventqueue.capacity=2 --conf spark.sql.codegen=true /appdata/bblite-codebase/pipeline_data_test_run.py > /appdata/bblite-data/logs/log_10_iter_pipeline_8_partitions_ 33_col.txt Stack-Trace below - ERROR CodeGenerator:91 - failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$Ge > neratedIteratorForCodegenStage3426" grows beyond 64 KB > org.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$Ge > neratedIteratorForCodegenStage3426" grows beyond 64 KB > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler. > java:361) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java: > 234) > at org.codehaus.janino.SimpleCompiler.compileToClassLoader(Simp > leCompiler.java:446) > at org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassB > odyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluat > or.java:235) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java: > 204) > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) > at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenera > tor$.org$apache$spark$sql$catalyst$expressions$codegen$C > odeGenerator$$doCompile(CodeGenerator.scala:1417) > at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenera > tor$$anon$1.load(CodeGenerator.scala:1493) > at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenera > tor$$anon$1.load(CodeGenerator.scala:1490) > at org.spark_project.guava.cache.LocalCache$LoadingValueReferen > ce.loadFuture(LocalCache.java:3599) > at org.spark_project.guava.cache.LocalCache$Segment.loadSync(Lo > calCache.java:2379) > at org.spark_project.guava.cache.LocalCache$Segment.lockedGetOr >
Re: Error: Could not find or load main class org.apache.spark.launcher.Main
Thank you Vamshi, Yes the path presumably has been added, here it is: rxie@ubuntu:~/Downloads/spark$ echo $PATH /home/rxie/Downloads/spark :/usr/bin/java:/usr/bin/java:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/rxie/Downloads/spark:/home/rxie/Downloads/spark/bin:/usr/bin/java rxie@ubuntu:~/Downloads/spark$ spark-shell Error: Could not find or load main class org.apache.spark.launcher.Main ** *Sincerely yours,* *Raymond* On Sun, Jun 17, 2018 at 8:44 AM, Vamshi Talla wrote: > Raymond, > > > Is your SPARK_HOME set? In your .bash_profile, try setting the below: > > export SPARK_HOME=/home/Downloads/spark (or wherever your spark is > downloaded to) > > once done, source your .bash_profile or restart the shell and try > spark-shell > > > Best Regards, > > Vamshi T > > > -- > *From:* Raymond Xie > *Sent:* Sunday, June 17, 2018 6:27 AM > *To:* user; Hui Xie > *Subject:* Error: Could not find or load main class > org.apache.spark.launcher.Main > > Hello, > > It would be really appreciated if anyone can help sort it out the > following path issue for me? I highly doubt this is related to missing path > setting but don't know how can I fix it. > > > > rxie@ubuntu:~/Downloads/spark$ echo $PATH > /usr/bin/java:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/ > bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/rxie/Downloads/s > > park:/home/rxie/Downloads/spark/bin:/usr/bin/java > rxie@ubuntu:~/Downloads/spark$ pyspark > Error: Could not find or load main class org.apache.spark.launcher.Main > rxie@ubuntu:~/Downloads/spark$ spark-shell > Error: Could not find or load main class org.apache.spark.launcher.Main > rxie@ubuntu:~/Downloads/spark$ pwd > /home/rxie/Downloads/spark > rxie@ubuntu:~/Downloads/spark$ ls > bin conf data examples jars kubernetes licenses R yarn > > > ** > *Sincerely yours,* > > > *Raymond* >
Re: Error: Could not find or load main class org.apache.spark.launcher.Main
Raymond, Is your SPARK_HOME set? In your .bash_profile, try setting the below: export SPARK_HOME=/home/Downloads/spark (or wherever your spark is downloaded to) once done, source your .bash_profile or restart the shell and try spark-shell Best Regards, Vamshi T From: Raymond Xie Sent: Sunday, June 17, 2018 6:27 AM To: user; Hui Xie Subject: Error: Could not find or load main class org.apache.spark.launcher.Main Hello, It would be really appreciated if anyone can help sort it out the following path issue for me? I highly doubt this is related to missing path setting but don't know how can I fix it. rxie@ubuntu:~/Downloads/spark$ echo $PATH /usr/bin/java:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/rxie/Downloads/s park:/home/rxie/Downloads/spark/bin:/usr/bin/java rxie@ubuntu:~/Downloads/spark$ pyspark Error: Could not find or load main class org.apache.spark.launcher.Main rxie@ubuntu:~/Downloads/spark$ spark-shell Error: Could not find or load main class org.apache.spark.launcher.Main rxie@ubuntu:~/Downloads/spark$ pwd /home/rxie/Downloads/spark rxie@ubuntu:~/Downloads/spark$ ls bin conf data examples jars kubernetes licenses R yarn Sincerely yours, Raymond
Error: Could not find or load main class org.apache.spark.launcher.Main
Hello, It would be really appreciated if anyone can help sort it out the following path issue for me? I highly doubt this is related to missing path setting but don't know how can I fix it. rxie@ubuntu:~/Downloads/spark$ echo $PATH /usr/bin/java:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/rxie/Downloads/s park:/home/rxie/Downloads/spark/bin:/usr/bin/java rxie@ubuntu:~/Downloads/spark$ pyspark Error: Could not find or load main class org.apache.spark.launcher.Main rxie@ubuntu:~/Downloads/spark$ spark-shell Error: Could not find or load main class org.apache.spark.launcher.Main rxie@ubuntu:~/Downloads/spark$ pwd /home/rxie/Downloads/spark rxie@ubuntu:~/Downloads/spark$ ls bin conf data examples jars kubernetes licenses R yarn ** *Sincerely yours,* *Raymond*
spark-shell doesn't start
Hello, I am doing the practice in Ubuntu now, here is the error I am encountering: rxie@ubuntu:~/Downloads/spark/bin$ spark-shell Error: Could not find or load main class org.apache.spark.launcher.Main What am I missing? Thank you very much. Java is installed. ** *Sincerely yours,* *Raymond*
Re: spark-submit Error: Cannot load main class from JAR file
Hi Raymond, I see that you can make a small correction in your spark-submit command. Your spark-submit command should say: spark-submit --master local --class . < Jar Location and JarName> Example: spark-submit --master local \ --class retail_db.GetRevenuePerOrder C:\RXIE\Learning\Scala\spark2practice\target\scala-2.11\spark2practice_2.11-0.1.jar \ C:\RXIE\Learning\Data\data-master\retail_db\order_items \ C:\RXIE\Learning\Data\data-master\retail_db\order_items\revenue_per_order Also, I believe you can "\" to split the commands in the multiple lines even on windows Best Regards, Vamshi T From: Raymond Xie Sent: Sunday, June 17, 2018 5:07 AM To: user; Hui Xie Subject: spark-submit Error: Cannot load main class from JAR file Hello, I am doing the practice in windows now. I have the jar file generated under: C:\RXIE\Learning\Scala\spark2practice\target\scala-2.11\spark2practice_2.11-0.1.jar The package name is Retail_db and the object is GetRevenuePerOrder. The spark-submit command is: spark-submit retail_db.GetRevenuePerOrder --class "C:\RXIE\Learning\Scala\spark2practice\target\scala-2.11\spark2practice_2.11-0.1.jar" local C:\RXIE\Learning\Data\data-master\retail_db\order_items C:\C:\RXIE\Learning\Data\data-master\retail_db\order_items\revenue_per_order When executing this command, it throws out the error: Error: Cannot load main class from JAR file What is the right submit command should I write here? Thank you very much. By the way, the command is really too long, how can I split it into multiple lines like '\' in Linux? Sincerely yours, Raymond
spark-submit Error: Cannot load main class from JAR file
Hello, I am doing the practice in windows now. I have the jar file generated under: C:\RXIE\Learning\Scala\spark2practice\target\scala-2. 11\spark2practice_2.11-0.1.jar The package name is Retail_db and the object is GetRevenuePerOrder. The spark-submit command is: spark-submit retail_db.GetRevenuePerOrder --class "C:\RXIE\Learning\Scala\ spark2practice\target\scala-2.11\spark2practice_2.11-0.1.jar" local C:\RXIE\Learning\Data\data-master\retail_db\order_items C:\C:\RXIE\Learning\Data\data-master\retail_db\order_items\revenue_per_order When executing this command, it throws out the error: Error: Cannot load main class from JAR file What is the right submit command should I write here? Thank you very much. By the way, the command is really too long, how can I split it into multiple lines like '\' in Linux? ** *Sincerely yours,* *Raymond*
Re: [Help] Codegen Stage grows beyond 64 KB
Hi Akash, such errors might appear in large spark pipelines, the root cause is a 64kb jvm limitation. the reason that your job isn't failing at the end is due to spark fallback - if code gen is failing, spark compiler will try to create the flow without the code gen (less optimized) if you do not want to see this error, you can either disable code gen using the flag: spark.sql.codegen.wholeStage= "false" or you can try to split your complex pipeline into several spark flows if possible hope that helps Eyal On Sun, Jun 17, 2018 at 8:16 AM, Aakash Basu wrote: > Hi, > > I already went through it, that's one use case. I've a complex and very > big pipeline of multiple jobs under one spark session. Not getting, on how > to solve this, as it is happening over Logistic Regression and Random > Forest models, which I'm just using from Spark ML package rather than doing > anything by myself. > > Thanks, > Aakash. > > On Sun 17 Jun, 2018, 8:21 AM vaquar khan, wrote: > >> Hi Akash, >> >> Please check stackoverflow. >> >> https://stackoverflow.com/questions/41098953/codegen- >> grows-beyond-64-kb-error-when-normalizing-large-pyspark-dataframe >> >> Regards, >> Vaquar khan >> >> On Sat, Jun 16, 2018 at 3:27 PM, Aakash Basu >> wrote: >> >>> Hi guys, >>> >>> I'm getting an error when I'm feature engineering on 30+ columns to >>> create about 200+ columns. It is not failing the job, but the ERROR shows. >>> I want to know how can I avoid this. >>> >>> Spark - 2.3.1 >>> Python - 3.6 >>> >>> Cluster Config - >>> 1 Master - 32 GB RAM, 16 Cores >>> 4 Slaves - 16 GB RAM, 8 Cores >>> >>> >>> Input data - 8 partitions of parquet file with snappy compression. >>> >>> My Spark-Submit -> spark-submit --master spark://192.168.60.20:7077 >>> --num-executors 4 --executor-cores 5 --executor-memory 10G --driver-cores 5 >>> --driver-memory 25G --conf spark.sql.shuffle.partitions=60 --conf >>> spark.driver.maxResultSize=2G --conf "spark.executor. >>> extraJavaOptions=-XX:+UseParallelGC" --conf >>> spark.scheduler.listenerbus.eventqueue.capacity=2 >>> --conf spark.sql.codegen=true >>> /appdata/bblite-codebase/pipeline_data_test_run.py >>> > /appdata/bblite-data/logs/log_10_iter_pipeline_8_partitions_33_col.txt >>> >>> Stack-Trace below - >>> >>> ERROR CodeGenerator:91 - failed to compile: >>> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code of method "processNext()V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$ GeneratedIteratorForCodegenStage3426" grows beyond 64 KB org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code of method "processNext()V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$ GeneratedIteratorForCodegenStage3426" grows beyond 64 KB at org.codehaus.janino.UnitCompiler.compileUnit( UnitCompiler.java:361) at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) at org.codehaus.janino.SimpleCompiler.compileToClassLoader( SimpleCompiler.java:446) at org.codehaus.janino.ClassBodyEvaluator.compileToClass( ClassBodyEvaluator.java:313) at org.codehaus.janino.ClassBodyEvaluator.cook( ClassBodyEvaluator.java:235) at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204) at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at org.apache.spark.sql.catalyst.expressions.codegen. CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$ CodeGenerator$$doCompile(CodeGenerator.scala:1417) at org.apache.spark.sql.catalyst.expressions.codegen. CodeGenerator$$anon$1.load(CodeGenerator.scala:1493) at org.apache.spark.sql.catalyst.expressions.codegen. CodeGenerator$$anon$1.load(CodeGenerator.scala:1490) at org.spark_project.guava.cache.LocalCache$LoadingValueReference. loadFuture(LocalCache.java:3599) at org.spark_project.guava.cache.LocalCache$Segment.loadSync( LocalCache.java:2379) at org.spark_project.guava.cache.LocalCache$Segment. lockedGetOrLoad(LocalCache.java:2342) at org.spark_project.guava.cache.LocalCache$Segment.get( LocalCache.java:2257) at org.spark_project.guava.cache.LocalCache.get(LocalCache. java:4000) at org.spark_project.guava.cache.LocalCache.getOrLoad( LocalCache.java:4004) at org.spark_project.guava.cache.LocalCache$LocalLoadingCache. get(LocalCache.java:4874) at org.apache.spark.sql.catalyst.expressions.codegen. CodeGenerator$.compile(CodeGenerator.scala:1365) at org.apache.spark.sql.execution.WholeStageCodegenExec. liftedTree1$1(WholeStageCodegenExec.scala:579) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute( WholeStageCodegenExec.scala:578) at org.apache.spark.sql.execution.SparkPlan$$anonfun$ execute$1.apply(SparkPlan.scala:131)