[jira] [Comment Edited] (SPARK-24086) Exception while executing spark streaming examples
[ https://issues.apache.org/jira/browse/SPARK-24086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16456003#comment-16456003 ] Chandra Hasan edited comment on SPARK-24086 at 4/27/18 6:47 AM: [~hyukjin.kwon] Thanks mate, i included necessary dependencies while executing and its working now. If someone is facing same issue here is the solution {code:java} spark-submit --jars kafka-clients-1.1.0.jar,spark-streaming_2.11-2.3.0.jar,spark-streaming-kafka-0-10_2.11-2.3.0.jar --class org.apache.spark.examples.streaming.JavaDirectKafkaWordCount target/original-spark-examples_2.11-2.4.0-SNAPSHOT.jar {code} Also [~hyukjin.kwon] I would like to inform that the consumer properties mentioned in the example file JavaDirectKafkaWordCount example isn't updated which throws configuration missing error and i need to rewrite the code as below {{}} {code:java} kafkaParams.put("bootstrap.servers", brokers); kafkaParams.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); kafkaParams.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); kafkaParams.put("group.id", "");{code} {{What do you say, Is it fine or need to open a bug for this?}} was (Author: hasan4791): [~hyukjin.kwon] Thanks mate, i included necessary dependencies while executing and its working now. If someone is facing same issue here is the solution {code:java} spark-submit --jars kafka-clients-1.1.0.jar,spark-streaming_2.11-2.3.0.jar,spark-streaming-kafka-0-10_2.11-2.3.0.jar --class org.apache.spark.examples.streaming.JavaDirectKafkaWordCount target/original-spark-examples_2.11-2.4.0-SNAPSHOT.jar {code} Also [~hyukjin.kwon] I would like to inform that the consumer properties mentioned in the example file JavaDirectKafkaWordCount example isn't updated which throws configuration missing error and i need to rewrite the code as below {{}} {code:java} kafkaParams.put("bootstrap.servers", brokers); kafkaParams.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); kafkaParams.put("value.deserializer", org.apache.kafka.common.serialization.StringDeserializer"); kafkaParams.put("group.id", "");{code} {{What do you say, Is it fine or need to open a bug for this?}} > Exception while executing spark streaming examples > -- > > Key: SPARK-24086 > URL: https://issues.apache.org/jira/browse/SPARK-24086 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 2.3.0 >Reporter: Chandra Hasan >Priority: Major > > After running mvn clean package, I tried to execute one of the spark example > program JavaDirectKafkaWordCount.java but throws following exeception. > {code:java} > [cloud-user@server-2 examples]$ run-example > streaming.JavaDirectKafkaWordCount 192.168.0.4:9092 msu > 2018-04-25 09:39:22 WARN NativeCodeLoader:62 - Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 2018-04-25 09:39:22 INFO SparkContext:54 - Running Spark version 2.3.0 > 2018-04-25 09:39:22 INFO SparkContext:54 - Submitted application: > JavaDirectKafkaWordCount > 2018-04-25 09:39:22 INFO SecurityManager:54 - Changing view acls to: > cloud-user > 2018-04-25 09:39:22 INFO SecurityManager:54 - Changing modify acls to: > cloud-user > 2018-04-25 09:39:22 INFO SecurityManager:54 - Changing view acls groups to: > 2018-04-25 09:39:22 INFO SecurityManager:54 - Changing modify acls groups to: > 2018-04-25 09:39:22 INFO SecurityManager:54 - SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(cloud-user); > groups with view permissions: Set(); users with modify permissions: > Set(cloud-user); groups with modify permissions: Set() > 2018-04-25 09:39:23 INFO Utils:54 - Successfully started service > 'sparkDriver' on port 59333. > 2018-04-25 09:39:23 INFO SparkEnv:54 - Registering MapOutputTracker > 2018-04-25 09:39:23 INFO SparkEnv:54 - Registering BlockManagerMaster > 2018-04-25 09:39:23 INFO BlockManagerMasterEndpoint:54 - Using > org.apache.spark.storage.DefaultTopologyMapper for getting topology > information > 2018-04-25 09:39:23 INFO BlockManagerMasterEndpoint:54 - > BlockManagerMasterEndpoint up > 2018-04-25 09:39:23 INFO DiskBlockManager:54 - Created local directory at > /tmp/blockmgr-6fc11fc1-f638-42ea-a9df-dc01fb81b7b6 > 2018-04-25 09:39:23 INFO MemoryStore:54 - MemoryStore started with capacity > 366.3 MB > 2018-04-25 09:39:23 INFO SparkEnv:54 - Registering OutputCommitCoordinator > 2018-04-25 09:39:23 INFO log:192 - Logging initialized @1825ms > 2018-04-25 09:39:23 INFO Server:346 - jetty-9.3.z-SNAPSHOT > 2018-04-25 09:39:23 INFO Server:414 - Started @1900ms > 2018-04-25 09:39:23 INFO Abstract
[jira] [Comment Edited] (SPARK-24086) Exception while executing spark streaming examples
[ https://issues.apache.org/jira/browse/SPARK-24086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16456003#comment-16456003 ] Chandra Hasan edited comment on SPARK-24086 at 4/27/18 6:47 AM: [~hyukjin.kwon] Thanks mate, i included necessary dependencies while executing and its working now. If someone is facing same issue here is the solution {code:java} spark-submit --jars kafka-clients-1.1.0.jar,spark-streaming_2.11-2.3.0.jar,spark-streaming-kafka-0-10_2.11-2.3.0.jar --class org.apache.spark.examples.streaming.JavaDirectKafkaWordCount target/original-spark-examples_2.11-2.4.0-SNAPSHOT.jar {code} Also [~hyukjin.kwon] I would like to inform that the consumer properties mentioned in the example file JavaDirectKafkaWordCount example isn't updated which throws configuration missing error and i need to rewrite the code as below {code:java} kafkaParams.put("bootstrap.servers", brokers); kafkaParams.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); kafkaParams.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); kafkaParams.put("group.id", "");{code} {{What do you say, Is it fine or need to open a bug for this?}} was (Author: hasan4791): [~hyukjin.kwon] Thanks mate, i included necessary dependencies while executing and its working now. If someone is facing same issue here is the solution {code:java} spark-submit --jars kafka-clients-1.1.0.jar,spark-streaming_2.11-2.3.0.jar,spark-streaming-kafka-0-10_2.11-2.3.0.jar --class org.apache.spark.examples.streaming.JavaDirectKafkaWordCount target/original-spark-examples_2.11-2.4.0-SNAPSHOT.jar {code} Also [~hyukjin.kwon] I would like to inform that the consumer properties mentioned in the example file JavaDirectKafkaWordCount example isn't updated which throws configuration missing error and i need to rewrite the code as below {{}} {code:java} kafkaParams.put("bootstrap.servers", brokers); kafkaParams.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); kafkaParams.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); kafkaParams.put("group.id", "");{code} {{What do you say, Is it fine or need to open a bug for this?}} > Exception while executing spark streaming examples > -- > > Key: SPARK-24086 > URL: https://issues.apache.org/jira/browse/SPARK-24086 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 2.3.0 >Reporter: Chandra Hasan >Priority: Major > > After running mvn clean package, I tried to execute one of the spark example > program JavaDirectKafkaWordCount.java but throws following exeception. > {code:java} > [cloud-user@server-2 examples]$ run-example > streaming.JavaDirectKafkaWordCount 192.168.0.4:9092 msu > 2018-04-25 09:39:22 WARN NativeCodeLoader:62 - Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 2018-04-25 09:39:22 INFO SparkContext:54 - Running Spark version 2.3.0 > 2018-04-25 09:39:22 INFO SparkContext:54 - Submitted application: > JavaDirectKafkaWordCount > 2018-04-25 09:39:22 INFO SecurityManager:54 - Changing view acls to: > cloud-user > 2018-04-25 09:39:22 INFO SecurityManager:54 - Changing modify acls to: > cloud-user > 2018-04-25 09:39:22 INFO SecurityManager:54 - Changing view acls groups to: > 2018-04-25 09:39:22 INFO SecurityManager:54 - Changing modify acls groups to: > 2018-04-25 09:39:22 INFO SecurityManager:54 - SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(cloud-user); > groups with view permissions: Set(); users with modify permissions: > Set(cloud-user); groups with modify permissions: Set() > 2018-04-25 09:39:23 INFO Utils:54 - Successfully started service > 'sparkDriver' on port 59333. > 2018-04-25 09:39:23 INFO SparkEnv:54 - Registering MapOutputTracker > 2018-04-25 09:39:23 INFO SparkEnv:54 - Registering BlockManagerMaster > 2018-04-25 09:39:23 INFO BlockManagerMasterEndpoint:54 - Using > org.apache.spark.storage.DefaultTopologyMapper for getting topology > information > 2018-04-25 09:39:23 INFO BlockManagerMasterEndpoint:54 - > BlockManagerMasterEndpoint up > 2018-04-25 09:39:23 INFO DiskBlockManager:54 - Created local directory at > /tmp/blockmgr-6fc11fc1-f638-42ea-a9df-dc01fb81b7b6 > 2018-04-25 09:39:23 INFO MemoryStore:54 - MemoryStore started with capacity > 366.3 MB > 2018-04-25 09:39:23 INFO SparkEnv:54 - Registering OutputCommitCoordinator > 2018-04-25 09:39:23 INFO log:192 - Logging initialized @1825ms > 2018-04-25 09:39:23 INFO Server:346 - jetty-9.3.z-SNAPSHOT > 2018-04-25 09:39:23 INFO Server:414 - Started @1900ms > 2018-04-25 09:39:23 INFO AbstractConne
[jira] [Comment Edited] (SPARK-24086) Exception while executing spark streaming examples
[ https://issues.apache.org/jira/browse/SPARK-24086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16456003#comment-16456003 ] Chandra Hasan edited comment on SPARK-24086 at 4/27/18 6:46 AM: [~hyukjin.kwon] Thanks mate, i included necessary dependencies while executing and its working now. If someone is facing same issue here is the solution {code:java} spark-submit --jars kafka-clients-1.1.0.jar,spark-streaming_2.11-2.3.0.jar,spark-streaming-kafka-0-10_2.11-2.3.0.jar --class org.apache.spark.examples.streaming.JavaDirectKafkaWordCount target/original-spark-examples_2.11-2.4.0-SNAPSHOT.jar {code} Also [~hyukjin.kwon] I would like to inform that the consumer properties mentioned in the example file JavaDirectKafkaWordCount example isn't updated which throws configuration missing error and i need to rewrite the code as below {{}} {code:java} kafkaParams.put("bootstrap.servers", brokers); kafkaParams.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); kafkaParams.put("value.deserializer", org.apache.kafka.common.serialization.StringDeserializer"); kafkaParams.put("group.id", "");{code} {{What do you say, Is it fine or need to open a bug for this?}} was (Author: hasan4791): [~hyukjin.kwon] Thanks mate, i included necessary dependencies while executing and its working now. If someone is facing same issue here is the solution {code:java} spark-submit --jars kafka-clients-1.1.0.jar,spark-streaming_2.11-2.3.0.jar,spark-streaming-kafka-0-10_2.11-2.3.0.jar --class org.apache.spark.examples.streaming.JavaDirectKafkaWordCount target/original-spark-examples_2.11-2.4.0-SNAPSHOT.jar {code} Also [~hyukjin.kwon] I would like to inform that the consumer properties mentioned in the example file JavaDirectKafkaWordCount example isn't updated which throws configuration missing error and i need to rewrite the code as below {{}} {code:java} kafkaParams.put("bootstrap.servers", brokers); kafkaParams.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); kafkaParams.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); kafkaParams.put("group.id", "");{code} {{What do you say, Is it fine or need to open a bug for this?}} > Exception while executing spark streaming examples > -- > > Key: SPARK-24086 > URL: https://issues.apache.org/jira/browse/SPARK-24086 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 2.3.0 >Reporter: Chandra Hasan >Priority: Major > > After running mvn clean package, I tried to execute one of the spark example > program JavaDirectKafkaWordCount.java but throws following exeception. > {code:java} > [cloud-user@server-2 examples]$ run-example > streaming.JavaDirectKafkaWordCount 192.168.0.4:9092 msu > 2018-04-25 09:39:22 WARN NativeCodeLoader:62 - Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 2018-04-25 09:39:22 INFO SparkContext:54 - Running Spark version 2.3.0 > 2018-04-25 09:39:22 INFO SparkContext:54 - Submitted application: > JavaDirectKafkaWordCount > 2018-04-25 09:39:22 INFO SecurityManager:54 - Changing view acls to: > cloud-user > 2018-04-25 09:39:22 INFO SecurityManager:54 - Changing modify acls to: > cloud-user > 2018-04-25 09:39:22 INFO SecurityManager:54 - Changing view acls groups to: > 2018-04-25 09:39:22 INFO SecurityManager:54 - Changing modify acls groups to: > 2018-04-25 09:39:22 INFO SecurityManager:54 - SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(cloud-user); > groups with view permissions: Set(); users with modify permissions: > Set(cloud-user); groups with modify permissions: Set() > 2018-04-25 09:39:23 INFO Utils:54 - Successfully started service > 'sparkDriver' on port 59333. > 2018-04-25 09:39:23 INFO SparkEnv:54 - Registering MapOutputTracker > 2018-04-25 09:39:23 INFO SparkEnv:54 - Registering BlockManagerMaster > 2018-04-25 09:39:23 INFO BlockManagerMasterEndpoint:54 - Using > org.apache.spark.storage.DefaultTopologyMapper for getting topology > information > 2018-04-25 09:39:23 INFO BlockManagerMasterEndpoint:54 - > BlockManagerMasterEndpoint up > 2018-04-25 09:39:23 INFO DiskBlockManager:54 - Created local directory at > /tmp/blockmgr-6fc11fc1-f638-42ea-a9df-dc01fb81b7b6 > 2018-04-25 09:39:23 INFO MemoryStore:54 - MemoryStore started with capacity > 366.3 MB > 2018-04-25 09:39:23 INFO SparkEnv:54 - Registering OutputCommitCoordinator > 2018-04-25 09:39:23 INFO log:192 - Logging initialized @1825ms > 2018-04-25 09:39:23 INFO Server:346 - jetty-9.3.z-SNAPSHOT > 2018-04-25 09:39:23 INFO Server:414 - Started @1900ms > 2018-04-25 09:39:23 INFO Abstra
[jira] [Commented] (SPARK-24109) Remove class SnappyOutputStreamWrapper
[ https://issues.apache.org/jira/browse/SPARK-24109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16456000#comment-16456000 ] Takeshi Yamamuro commented on SPARK-24109: -- IMO it'd be better to keep this ticket open because the wrapper can be remove in future, but not now. For related discussion, see: https://github.com/apache/spark/pull/18949#issuecomment-323354674 > Remove class SnappyOutputStreamWrapper > -- > > Key: SPARK-24109 > URL: https://issues.apache.org/jira/browse/SPARK-24109 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Input/Output >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: wangjinhai >Priority: Minor > Fix For: 2.4.0 > > > Wrapper over `SnappyOutputStream` which guards against write-after-close and > double-close > issues. See SPARK-7660 for more details. > This wrapping can be removed if we upgrade to a version > of snappy-java that contains the fix for > [https://github.com/xerial/snappy-java/issues/107.] > {{snappy-java:1.1.2+ fixed the bug}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24086) Exception while executing spark streaming examples
[ https://issues.apache.org/jira/browse/SPARK-24086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16456003#comment-16456003 ] Chandra Hasan commented on SPARK-24086: --- [~hyukjin.kwon] Thanks mate, i included necessary dependencies while executing and its working now. If someone is facing same issue here is the solution {code:java} spark-submit --jars kafka-clients-1.1.0.jar,spark-streaming_2.11-2.3.0.jar,spark-streaming-kafka-0-10_2.11-2.3.0.jar --class org.apache.spark.examples.streaming.JavaDirectKafkaWordCount target/original-spark-examples_2.11-2.4.0-SNAPSHOT.jar {code} Also [~hyukjin.kwon] I would like to inform that the consumer properties mentioned in the example file JavaDirectKafkaWordCount example isn't updated which throws configuration missing error and i need to rewrite the code as below {{}} {code:java} kafkaParams.put("bootstrap.servers", brokers); kafkaParams.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); kafkaParams.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); kafkaParams.put("group.id", "");{code} {{What do you say, Is it fine or need to open a bug for this?}} > Exception while executing spark streaming examples > -- > > Key: SPARK-24086 > URL: https://issues.apache.org/jira/browse/SPARK-24086 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 2.3.0 >Reporter: Chandra Hasan >Priority: Major > > After running mvn clean package, I tried to execute one of the spark example > program JavaDirectKafkaWordCount.java but throws following exeception. > {code:java} > [cloud-user@server-2 examples]$ run-example > streaming.JavaDirectKafkaWordCount 192.168.0.4:9092 msu > 2018-04-25 09:39:22 WARN NativeCodeLoader:62 - Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 2018-04-25 09:39:22 INFO SparkContext:54 - Running Spark version 2.3.0 > 2018-04-25 09:39:22 INFO SparkContext:54 - Submitted application: > JavaDirectKafkaWordCount > 2018-04-25 09:39:22 INFO SecurityManager:54 - Changing view acls to: > cloud-user > 2018-04-25 09:39:22 INFO SecurityManager:54 - Changing modify acls to: > cloud-user > 2018-04-25 09:39:22 INFO SecurityManager:54 - Changing view acls groups to: > 2018-04-25 09:39:22 INFO SecurityManager:54 - Changing modify acls groups to: > 2018-04-25 09:39:22 INFO SecurityManager:54 - SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(cloud-user); > groups with view permissions: Set(); users with modify permissions: > Set(cloud-user); groups with modify permissions: Set() > 2018-04-25 09:39:23 INFO Utils:54 - Successfully started service > 'sparkDriver' on port 59333. > 2018-04-25 09:39:23 INFO SparkEnv:54 - Registering MapOutputTracker > 2018-04-25 09:39:23 INFO SparkEnv:54 - Registering BlockManagerMaster > 2018-04-25 09:39:23 INFO BlockManagerMasterEndpoint:54 - Using > org.apache.spark.storage.DefaultTopologyMapper for getting topology > information > 2018-04-25 09:39:23 INFO BlockManagerMasterEndpoint:54 - > BlockManagerMasterEndpoint up > 2018-04-25 09:39:23 INFO DiskBlockManager:54 - Created local directory at > /tmp/blockmgr-6fc11fc1-f638-42ea-a9df-dc01fb81b7b6 > 2018-04-25 09:39:23 INFO MemoryStore:54 - MemoryStore started with capacity > 366.3 MB > 2018-04-25 09:39:23 INFO SparkEnv:54 - Registering OutputCommitCoordinator > 2018-04-25 09:39:23 INFO log:192 - Logging initialized @1825ms > 2018-04-25 09:39:23 INFO Server:346 - jetty-9.3.z-SNAPSHOT > 2018-04-25 09:39:23 INFO Server:414 - Started @1900ms > 2018-04-25 09:39:23 INFO AbstractConnector:278 - Started > ServerConnector@6813a331{HTTP/1.1,[http/1.1]}{0.0.0.0:4040} > 2018-04-25 09:39:23 INFO Utils:54 - Successfully started service 'SparkUI' on > port 4040. > 2018-04-25 09:39:23 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@4f7c0be3{/jobs,null,AVAILABLE,@Spark} > 2018-04-25 09:39:23 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@4cfbaf4{/jobs/json,null,AVAILABLE,@Spark} > 2018-04-25 09:39:23 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@58faa93b{/jobs/job,null,AVAILABLE,@Spark} > 2018-04-25 09:39:23 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@127d7908{/jobs/job/json,null,AVAILABLE,@Spark} > 2018-04-25 09:39:23 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@6b9c69a9{/stages,null,AVAILABLE,@Spark} > 2018-04-25 09:39:23 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@6622a690{/stages/json,null,AVAILABLE,@Spark} > 2018-04-25 09:39:23 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@30b9eadd{/stages/stage,null,AVAILABLE,@Spar
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16455734#comment-16455734 ] Dilip Biswal commented on SPARK-21274: -- [~maropu] Thanks. Yeah.. i will have two separate PRs. > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24109) Remove class SnappyOutputStreamWrapper
[ https://issues.apache.org/jira/browse/SPARK-24109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24109: Assignee: (was: Apache Spark) > Remove class SnappyOutputStreamWrapper > -- > > Key: SPARK-24109 > URL: https://issues.apache.org/jira/browse/SPARK-24109 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Input/Output >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: wangjinhai >Priority: Minor > Fix For: 2.4.0 > > > Wrapper over `SnappyOutputStream` which guards against write-after-close and > double-close > issues. See SPARK-7660 for more details. > This wrapping can be removed if we upgrade to a version > of snappy-java that contains the fix for > [https://github.com/xerial/snappy-java/issues/107.] > {{snappy-java:1.1.2+ fixed the bug}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24109) Remove class SnappyOutputStreamWrapper
[ https://issues.apache.org/jira/browse/SPARK-24109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16455731#comment-16455731 ] Apache Spark commented on SPARK-24109: -- User 'manbuyun' has created a pull request for this issue: https://github.com/apache/spark/pull/21176 > Remove class SnappyOutputStreamWrapper > -- > > Key: SPARK-24109 > URL: https://issues.apache.org/jira/browse/SPARK-24109 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Input/Output >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: wangjinhai >Priority: Minor > Fix For: 2.4.0 > > > Wrapper over `SnappyOutputStream` which guards against write-after-close and > double-close > issues. See SPARK-7660 for more details. > This wrapping can be removed if we upgrade to a version > of snappy-java that contains the fix for > [https://github.com/xerial/snappy-java/issues/107.] > {{snappy-java:1.1.2+ fixed the bug}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24109) Remove class SnappyOutputStreamWrapper
[ https://issues.apache.org/jira/browse/SPARK-24109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24109: Assignee: Apache Spark > Remove class SnappyOutputStreamWrapper > -- > > Key: SPARK-24109 > URL: https://issues.apache.org/jira/browse/SPARK-24109 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Input/Output >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: wangjinhai >Assignee: Apache Spark >Priority: Minor > Fix For: 2.4.0 > > > Wrapper over `SnappyOutputStream` which guards against write-after-close and > double-close > issues. See SPARK-7660 for more details. > This wrapping can be removed if we upgrade to a version > of snappy-java that contains the fix for > [https://github.com/xerial/snappy-java/issues/107.] > {{snappy-java:1.1.2+ fixed the bug}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24109) Remove class SnappyOutputStreamWrapper
wangjinhai created SPARK-24109: -- Summary: Remove class SnappyOutputStreamWrapper Key: SPARK-24109 URL: https://issues.apache.org/jira/browse/SPARK-24109 Project: Spark Issue Type: Improvement Components: Block Manager, Input/Output Affects Versions: 2.3.0, 2.2.1, 2.2.0 Reporter: wangjinhai Fix For: 2.4.0 Wrapper over `SnappyOutputStream` which guards against write-after-close and double-close issues. See SPARK-7660 for more details. This wrapping can be removed if we upgrade to a version of snappy-java that contains the fix for [https://github.com/xerial/snappy-java/issues/107.] {{snappy-java:1.1.2+ fixed the bug}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16455716#comment-16455716 ] Takeshi Yamamuro commented on SPARK-21274: -- ok, thanks! IMO it'd be better to make separate two prs to implement `EXCEPT ALL` and `INTERSECT ALL`. > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24108) ChunkedByteBuffer.writeFully method has not reset the limit value
[ https://issues.apache.org/jira/browse/SPARK-24108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangjinhai resolved SPARK-24108. Resolution: Won't Do > ChunkedByteBuffer.writeFully method has not reset the limit value > - > > Key: SPARK-24108 > URL: https://issues.apache.org/jira/browse/SPARK-24108 > Project: Spark > Issue Type: Bug > Components: Block Manager, Input/Output >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: wangjinhai >Priority: Major > Fix For: 2.4.0 > > > ChunkedByteBuffer.writeFully method has not reset the limit value. When > chunks larger than bufferWriteChunkSize, such as 80*1024*1024 larger than > config.BUFFER_WRITE_CHUNK_SIZE(64 * 1024 * 1024),only while once, will lost > 16*1024*1024 byte -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24107) ChunkedByteBuffer.writeFully method has not reset the limit value
[ https://issues.apache.org/jira/browse/SPARK-24107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24107: Assignee: Apache Spark > ChunkedByteBuffer.writeFully method has not reset the limit value > - > > Key: SPARK-24107 > URL: https://issues.apache.org/jira/browse/SPARK-24107 > Project: Spark > Issue Type: Bug > Components: Block Manager, Input/Output >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: wangjinhai >Assignee: Apache Spark >Priority: Major > Fix For: 2.4.0 > > > ChunkedByteBuffer.writeFully method has not reset the limit value. When > chunks larger than bufferWriteChunkSize, such as 80*1024*1024 larger than > config.BUFFER_WRITE_CHUNK_SIZE(64 * 1024 * 1024),only while once, will lost > 16*1024*1024 byte -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24107) ChunkedByteBuffer.writeFully method has not reset the limit value
[ https://issues.apache.org/jira/browse/SPARK-24107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24107: Assignee: (was: Apache Spark) > ChunkedByteBuffer.writeFully method has not reset the limit value > - > > Key: SPARK-24107 > URL: https://issues.apache.org/jira/browse/SPARK-24107 > Project: Spark > Issue Type: Bug > Components: Block Manager, Input/Output >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: wangjinhai >Priority: Major > Fix For: 2.4.0 > > > ChunkedByteBuffer.writeFully method has not reset the limit value. When > chunks larger than bufferWriteChunkSize, such as 80*1024*1024 larger than > config.BUFFER_WRITE_CHUNK_SIZE(64 * 1024 * 1024),only while once, will lost > 16*1024*1024 byte -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24107) ChunkedByteBuffer.writeFully method has not reset the limit value
[ https://issues.apache.org/jira/browse/SPARK-24107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16455700#comment-16455700 ] Apache Spark commented on SPARK-24107: -- User 'manbuyun' has created a pull request for this issue: https://github.com/apache/spark/pull/21175 > ChunkedByteBuffer.writeFully method has not reset the limit value > - > > Key: SPARK-24107 > URL: https://issues.apache.org/jira/browse/SPARK-24107 > Project: Spark > Issue Type: Bug > Components: Block Manager, Input/Output >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: wangjinhai >Priority: Major > Fix For: 2.4.0 > > > ChunkedByteBuffer.writeFully method has not reset the limit value. When > chunks larger than bufferWriteChunkSize, such as 80*1024*1024 larger than > config.BUFFER_WRITE_CHUNK_SIZE(64 * 1024 * 1024),only while once, will lost > 16*1024*1024 byte -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16455697#comment-16455697 ] Dilip Biswal commented on SPARK-21274: -- [~maropu] I am currently testing the code. I will open the PRs as soon as i finish. > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24108) ChunkedByteBuffer.writeFully method has not reset the limit value
wangjinhai created SPARK-24108: -- Summary: ChunkedByteBuffer.writeFully method has not reset the limit value Key: SPARK-24108 URL: https://issues.apache.org/jira/browse/SPARK-24108 Project: Spark Issue Type: Bug Components: Block Manager, Input/Output Affects Versions: 2.3.0, 2.2.1, 2.2.0 Reporter: wangjinhai Fix For: 2.4.0 ChunkedByteBuffer.writeFully method has not reset the limit value. When chunks larger than bufferWriteChunkSize, such as 80*1024*1024 larger than config.BUFFER_WRITE_CHUNK_SIZE(64 * 1024 * 1024),only while once, will lost 16*1024*1024 byte -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24107) ChunkedByteBuffer.writeFully method has not reset the limit value
wangjinhai created SPARK-24107: -- Summary: ChunkedByteBuffer.writeFully method has not reset the limit value Key: SPARK-24107 URL: https://issues.apache.org/jira/browse/SPARK-24107 Project: Spark Issue Type: Bug Components: Block Manager, Input/Output Affects Versions: 2.3.0, 2.2.1, 2.2.0 Reporter: wangjinhai Fix For: 2.4.0 ChunkedByteBuffer.writeFully method has not reset the limit value. When chunks larger than bufferWriteChunkSize, such as 80*1024*1024 larger than config.BUFFER_WRITE_CHUNK_SIZE(64 * 1024 * 1024),only while once, will lost 16*1024*1024 byte -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23355) convertMetastore should not ignore table properties
[ https://issues.apache.org/jira/browse/SPARK-23355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23355. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20522 [https://github.com/apache/spark/pull/20522] > convertMetastore should not ignore table properties > --- > > Key: SPARK-23355 > URL: https://issues.apache.org/jira/browse/SPARK-23355 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.4.0 > > > `convertMetastoreOrc/Parquet` ignores table properties. > This happens for a table created by `STORED AS ORC/PARQUET`. > Currently, `convertMetastoreParquet` is `true` by default. So, it's a > well-known user-facing issue. For `convertMetastoreOrc`, it has been `false` > and SPARK-22279 tried to turn on that for Feature Parity with Parquet. > However, it's indeed a regression for the existing Hive ORC table users. So, > it'll be reverted for Apache Spark 2.3 via > https://github.com/apache/spark/pull/20536 . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23355) convertMetastore should not ignore table properties
[ https://issues.apache.org/jira/browse/SPARK-23355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-23355: --- Assignee: Dongjoon Hyun > convertMetastore should not ignore table properties > --- > > Key: SPARK-23355 > URL: https://issues.apache.org/jira/browse/SPARK-23355 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.4.0 > > > `convertMetastoreOrc/Parquet` ignores table properties. > This happens for a table created by `STORED AS ORC/PARQUET`. > Currently, `convertMetastoreParquet` is `true` by default. So, it's a > well-known user-facing issue. For `convertMetastoreOrc`, it has been `false` > and SPARK-22279 tried to turn on that for Feature Parity with Parquet. > However, it's indeed a regression for the existing Hive ORC table users. So, > it'll be reverted for Apache Spark 2.3 via > https://github.com/apache/spark/pull/20536 . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24100) Add the CompressionCodec to the saveAsTextFiles interface.
[ https://issues.apache.org/jira/browse/SPARK-24100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caijie updated SPARK-24100: --- Component/s: (was: DStreams) > Add the CompressionCodec to the saveAsTextFiles interface. > -- > > Key: SPARK-24100 > URL: https://issues.apache.org/jira/browse/SPARK-24100 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: caijie >Priority: Minor > > Add the CompressionCodec to the saveAsTextFiles interface. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23830) Spark on YARN in cluster deploy mode fail with NullPointerException when a Spark application is a Scala class not object
[ https://issues.apache.org/jira/browse/SPARK-23830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16455621#comment-16455621 ] Saisai Shao commented on SPARK-23830: - I agree [~emaynard]. > Spark on YARN in cluster deploy mode fail with NullPointerException when a > Spark application is a Scala class not object > > > Key: SPARK-23830 > URL: https://issues.apache.org/jira/browse/SPARK-23830 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Priority: Trivial > > As reported on StackOverflow in [Why does Spark on YARN fail with “Exception > in thread ”Driver“ > java.lang.NullPointerException”?|https://stackoverflow.com/q/49564334/1305344] > the following Spark application fails with {{Exception in thread "Driver" > java.lang.NullPointerException}} with Spark on YARN in cluster deploy mode: > {code} > class MyClass { > def main(args: Array[String]): Unit = { > val c = new MyClass() > c.process() > } > def process(): Unit = { > val sparkConf = new SparkConf().setAppName("my-test") > val sparkSession: SparkSession = > SparkSession.builder().config(sparkConf).getOrCreate() > import sparkSession.implicits._ > > } > ... > } > {code} > The exception is as follows: > {code} > 18/03/29 20:07:52 INFO ApplicationMaster: Starting the user application in a > separate Thread > 18/03/29 20:07:52 INFO ApplicationMaster: Waiting for spark context > initialization... > Exception in thread "Driver" java.lang.NullPointerException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637) > {code} > I think the reason for the exception {{Exception in thread "Driver" > java.lang.NullPointerException}} is due to [the following > code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L700-L701]: > {code} > val mainMethod = userClassLoader.loadClass(args.userClass) > .getMethod("main", classOf[Array[String]]) > {code} > So when {{mainMethod}} is used in [the following > code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L706] > it simply gives NPE. > {code} > mainMethod.invoke(null, userArgs.toArray) > {code} > That could be easily avoided with an extra check if the {{mainMethod}} is > initialized and give a user a message what may have been a reason. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24099) java.io.CharConversionException: Invalid UTF-32 character prevents me from querying my data in JSON
[ https://issues.apache.org/jira/browse/SPARK-24099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24099. -- Resolution: Duplicate It will be fixed in SPARK-23723 soon. > java.io.CharConversionException: Invalid UTF-32 character prevents me from > querying my data in JSON > --- > > Key: SPARK-24099 > URL: https://issues.apache.org/jira/browse/SPARK-24099 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: OS: SUSE 11 > Spark Version: 2.3 > >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > Steps: > # Launch spark-sql --master yarn > # create table json(name STRING, age int, gender string, id INT) using > org.apache.spark.sql.json options(path "hdfs:///user/testdemo/"); > # Execute the below SQL queries > INSERT into json > SELECT 'Shaan',21,'Male',1 > UNION ALL > SELECT 'Xing',20,'Female',11 > UNION ALL > SELECT 'Mile',4,'Female',20 > UNION ALL > SELECT 'Malan',10,'Male',9; > # Select * from json; > Throws below Exception > Caused by: *java.io.CharConversionException: Invalid UTF-32 character* > 0x151a15(above 10) at char #1, byte #7) > at > com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189) > at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150) > at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153) > at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:2017) > at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:577) > at > org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$parse$2.apply(JacksonParser.scala:350) > at > org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$parse$2.apply(JacksonParser.scala:347) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2585) > at > org.apache.spark.sql.catalyst.json.JacksonParser.parse(JacksonParser.scala:347) > at > org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource$$anonfun$3.apply(JsonDataSource.scala:126) > at > org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource$$anonfun$3.apply(JsonDataSource.scala:126) > at > org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:61) > at > org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource$$anonfun$readFile$2.apply(JsonDataSource.scala:130) > at > org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource$$anonfun$readFile$2.apply(JsonDataSource.scala:130) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > Note: > https://issues.apache.org/jira/browse/SPARK-16548 Jira raised in 1.6 and said > Fixed in 2.3 but still I am getting same Error. > Please update on this. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24068) CSV schema inferring doesn't work for compressed files
[ https://issues.apache.org/jira/browse/SPARK-24068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16455611#comment-16455611 ] Hyukjin Kwon edited comment on SPARK-24068 at 4/27/18 12:51 AM: I roughly assume the fix will be small, similar or the same? I think it's fine to describe both issues here. was (Author: hyukjin.kwon): I roughly assume the fix will be small, similar or the same? I think it's fix to describe both issues here. > CSV schema inferring doesn't work for compressed files > -- > > Key: SPARK-24068 > URL: https://issues.apache.org/jira/browse/SPARK-24068 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > > Here is a simple csv file compressed by lzo > {code} > $ cat ./test.csv > col1,col2 > a,1 > $ lzop ./test.csv > $ ls > test.csv test.csv.lzo > {code} > Reading test.csv.lzo with LZO codec (see > https://github.com/twitter/hadoop-lzo, for example): > {code:scala} > scala> val ds = spark.read.option("header", true).option("inferSchema", > true).option("io.compression.codecs", > "com.hadoop.compression.lzo.LzopCodec").csv("/Users/maximgekk/tmp/issue/test.csv.lzo") > ds: org.apache.spark.sql.DataFrame = [�LZO?: string] > scala> ds.printSchema > root > |-- �LZO: string (nullable = true) > scala> ds.show > +-+ > |�LZO| > +-+ > |a| > +-+ > {code} > but the file can be read if the schema is specified: > {code} > scala> import org.apache.spark.sql.types._ > scala> val schema = new StructType().add("col1", StringType).add("col2", > IntegerType) > scala> val ds = spark.read.schema(schema).option("header", > true).option("io.compression.codecs", > "com.hadoop.compression.lzo.LzopCodec").csv("test.csv.lzo") > scala> ds.show > +++ > |col1|col2| > +++ > | a| 1| > +++ > {code} > Just in case, schema inferring works for the original uncompressed file: > {code:scala} > scala> spark.read.option("header", true).option("inferSchema", > true).csv("test.csv").printSchema > root > |-- col1: string (nullable = true) > |-- col2: integer (nullable = true) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24068) CSV schema inferring doesn't work for compressed files
[ https://issues.apache.org/jira/browse/SPARK-24068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16455611#comment-16455611 ] Hyukjin Kwon commented on SPARK-24068: -- I roughly assume the fix will be small, similar or the same? I think it's fix to describe both issues here. > CSV schema inferring doesn't work for compressed files > -- > > Key: SPARK-24068 > URL: https://issues.apache.org/jira/browse/SPARK-24068 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > > Here is a simple csv file compressed by lzo > {code} > $ cat ./test.csv > col1,col2 > a,1 > $ lzop ./test.csv > $ ls > test.csv test.csv.lzo > {code} > Reading test.csv.lzo with LZO codec (see > https://github.com/twitter/hadoop-lzo, for example): > {code:scala} > scala> val ds = spark.read.option("header", true).option("inferSchema", > true).option("io.compression.codecs", > "com.hadoop.compression.lzo.LzopCodec").csv("/Users/maximgekk/tmp/issue/test.csv.lzo") > ds: org.apache.spark.sql.DataFrame = [�LZO?: string] > scala> ds.printSchema > root > |-- �LZO: string (nullable = true) > scala> ds.show > +-+ > |�LZO| > +-+ > |a| > +-+ > {code} > but the file can be read if the schema is specified: > {code} > scala> import org.apache.spark.sql.types._ > scala> val schema = new StructType().add("col1", StringType).add("col2", > IntegerType) > scala> val ds = spark.read.schema(schema).option("header", > true).option("io.compression.codecs", > "com.hadoop.compression.lzo.LzopCodec").csv("test.csv.lzo") > scala> ds.show > +++ > |col1|col2| > +++ > | a| 1| > +++ > {code} > Just in case, schema inferring works for the original uncompressed file: > {code:scala} > scala> spark.read.option("header", true).option("inferSchema", > true).csv("test.csv").printSchema > root > |-- col1: string (nullable = true) > |-- col2: integer (nullable = true) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23925) High-order function: repeat(element, count) → array
[ https://issues.apache.org/jira/browse/SPARK-23925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16455506#comment-16455506 ] Florent Pepin commented on SPARK-23925: --- Hey, sorry I didn't realise the complexity of this one, mainly because of the existing repeat function that takes in a String and outputs a String. I've first created an array specific function that works fine, now I'm trying to see how I can consolidate that with the existing String one into one function. > High-order function: repeat(element, count) → array > --- > > Key: SPARK-23925 > URL: https://issues.apache.org/jira/browse/SPARK-23925 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Repeat element for count times. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24083) Diagnostics message for uncaught exceptions should include the stacktrace
[ https://issues.apache.org/jira/browse/SPARK-24083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-24083: -- Assignee: zhoukang > Diagnostics message for uncaught exceptions should include the stacktrace > - > > Key: SPARK-24083 > URL: https://issues.apache.org/jira/browse/SPARK-24083 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.3.0 >Reporter: zhoukang >Assignee: zhoukang >Priority: Minor > Fix For: 2.4.0 > > > Like [SPARK-23296|https://issues.apache.org/jira/browse/SPARK-23296]. > For uncaught exceptions, we should also print the stacktrace. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24083) Diagnostics message for uncaught exceptions should include the stacktrace
[ https://issues.apache.org/jira/browse/SPARK-24083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-24083. Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21151 [https://github.com/apache/spark/pull/21151] > Diagnostics message for uncaught exceptions should include the stacktrace > - > > Key: SPARK-24083 > URL: https://issues.apache.org/jira/browse/SPARK-24083 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.3.0 >Reporter: zhoukang >Assignee: zhoukang >Priority: Minor > Fix For: 2.4.0 > > > Like [SPARK-23296|https://issues.apache.org/jira/browse/SPARK-23296]. > For uncaught exceptions, we should also print the stacktrace. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16455470#comment-16455470 ] Takeshi Yamamuro commented on SPARK-21274: -- Looks great to me. I checked the queries above worked well. Any plan to make a pr? > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24085) Scalar subquery error
[ https://issues.apache.org/jira/browse/SPARK-24085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24085: Assignee: (was: Apache Spark) > Scalar subquery error > - > > Key: SPARK-24085 > URL: https://issues.apache.org/jira/browse/SPARK-24085 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Alexey Baturin >Priority: Major > > Error > {noformat} > SQL Error: java.lang.UnsupportedOperationException: Cannot evaluate > expression: scalar-subquery{noformat} > Then query a partitioed table based on a parquet file then filter by > partition column by scalar subquery. > Query to reproduce: > {code:sql} > CREATE TABLE test_prc_bug ( > id_value string > ) > partitioned by (id_type string) > location '/tmp/test_prc_bug' > stored as parquet; > insert into test_prc_bug values ('1','a'); > insert into test_prc_bug values ('2','a'); > insert into test_prc_bug values ('3','b'); > insert into test_prc_bug values ('4','b'); > select * from test_prc_bug > where id_type = (select 'b'); > {code} > If table in ORC format it works fine -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24085) Scalar subquery error
[ https://issues.apache.org/jira/browse/SPARK-24085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24085: Assignee: Apache Spark > Scalar subquery error > - > > Key: SPARK-24085 > URL: https://issues.apache.org/jira/browse/SPARK-24085 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Alexey Baturin >Assignee: Apache Spark >Priority: Major > > Error > {noformat} > SQL Error: java.lang.UnsupportedOperationException: Cannot evaluate > expression: scalar-subquery{noformat} > Then query a partitioed table based on a parquet file then filter by > partition column by scalar subquery. > Query to reproduce: > {code:sql} > CREATE TABLE test_prc_bug ( > id_value string > ) > partitioned by (id_type string) > location '/tmp/test_prc_bug' > stored as parquet; > insert into test_prc_bug values ('1','a'); > insert into test_prc_bug values ('2','a'); > insert into test_prc_bug values ('3','b'); > insert into test_prc_bug values ('4','b'); > select * from test_prc_bug > where id_type = (select 'b'); > {code} > If table in ORC format it works fine -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24085) Scalar subquery error
[ https://issues.apache.org/jira/browse/SPARK-24085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16455468#comment-16455468 ] Apache Spark commented on SPARK-24085: -- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/21174 > Scalar subquery error > - > > Key: SPARK-24085 > URL: https://issues.apache.org/jira/browse/SPARK-24085 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Alexey Baturin >Priority: Major > > Error > {noformat} > SQL Error: java.lang.UnsupportedOperationException: Cannot evaluate > expression: scalar-subquery{noformat} > Then query a partitioed table based on a parquet file then filter by > partition column by scalar subquery. > Query to reproduce: > {code:sql} > CREATE TABLE test_prc_bug ( > id_value string > ) > partitioned by (id_type string) > location '/tmp/test_prc_bug' > stored as parquet; > insert into test_prc_bug values ('1','a'); > insert into test_prc_bug values ('2','a'); > insert into test_prc_bug values ('3','b'); > insert into test_prc_bug values ('4','b'); > select * from test_prc_bug > where id_type = (select 'b'); > {code} > If table in ORC format it works fine -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24044) Explicitly print out skipped tests from unittest module
[ https://issues.apache.org/jira/browse/SPARK-24044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-24044. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21107 [https://github.com/apache/spark/pull/21107] > Explicitly print out skipped tests from unittest module > --- > > Key: SPARK-24044 > URL: https://issues.apache.org/jira/browse/SPARK-24044 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.4.0 > > > There was an actual issue, SPARK-23300, and we fixed this by manually > checking if the package is installed. This way needed duplicated codes and > could only check dependencies. There are many conditions, for example, Python > version specific or other packages like NumPy. I think this is something we > should fix. > `unittest` module can print out the skipped messages but these were swallowed > so far in our own testing script. This PR prints out the messages below after > sorted. > It would be nicer if we remove the duplications and print out all the skipped > tests. For example, as below: > This PR proposes to remove duplicated dependency checking logics and also > print out skipped tests from unittests. > For example, as below: > {code} > Skipped tests in pyspark.sql.tests with pypy: > test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) > ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' > test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) > ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' > ... > Skipped tests in pyspark.sql.tests with python3: > test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) > ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' > test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) > ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' > ... > {code} > Actual format can be a bit varied per the discussion in the PR. Please check > out the PR for exact format. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24044) Explicitly print out skipped tests from unittest module
[ https://issues.apache.org/jira/browse/SPARK-24044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-24044: Assignee: Hyukjin Kwon > Explicitly print out skipped tests from unittest module > --- > > Key: SPARK-24044 > URL: https://issues.apache.org/jira/browse/SPARK-24044 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.4.0 > > > There was an actual issue, SPARK-23300, and we fixed this by manually > checking if the package is installed. This way needed duplicated codes and > could only check dependencies. There are many conditions, for example, Python > version specific or other packages like NumPy. I think this is something we > should fix. > `unittest` module can print out the skipped messages but these were swallowed > so far in our own testing script. This PR prints out the messages below after > sorted. > It would be nicer if we remove the duplications and print out all the skipped > tests. For example, as below: > This PR proposes to remove duplicated dependency checking logics and also > print out skipped tests from unittests. > For example, as below: > {code} > Skipped tests in pyspark.sql.tests with pypy: > test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) > ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' > test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) > ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' > ... > Skipped tests in pyspark.sql.tests with python3: > test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) > ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' > test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) > ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' > ... > {code} > Actual format can be a bit varied per the discussion in the PR. Please check > out the PR for exact format. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23856) Spark jdbc setQueryTimeout option
[ https://issues.apache.org/jira/browse/SPARK-23856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23856: Assignee: (was: Apache Spark) > Spark jdbc setQueryTimeout option > - > > Key: SPARK-23856 > URL: https://issues.apache.org/jira/browse/SPARK-23856 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dmitry Mikhailov >Priority: Minor > > It would be nice if a user could set the jdbc setQueryTimeout option when > running jdbc in Spark. I think it can be easily implemented by adding option > field to _JDBCOptions_ class and applying this option when initializing jdbc > statements in _JDBCRDD_ class. But only some DB vendors support this jdbc > feature. Is it worth starting a work on this option? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23856) Spark jdbc setQueryTimeout option
[ https://issues.apache.org/jira/browse/SPARK-23856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16455342#comment-16455342 ] Apache Spark commented on SPARK-23856: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/21173 > Spark jdbc setQueryTimeout option > - > > Key: SPARK-23856 > URL: https://issues.apache.org/jira/browse/SPARK-23856 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dmitry Mikhailov >Priority: Minor > > It would be nice if a user could set the jdbc setQueryTimeout option when > running jdbc in Spark. I think it can be easily implemented by adding option > field to _JDBCOptions_ class and applying this option when initializing jdbc > statements in _JDBCRDD_ class. But only some DB vendors support this jdbc > feature. Is it worth starting a work on this option? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23856) Spark jdbc setQueryTimeout option
[ https://issues.apache.org/jira/browse/SPARK-23856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23856: Assignee: Apache Spark > Spark jdbc setQueryTimeout option > - > > Key: SPARK-23856 > URL: https://issues.apache.org/jira/browse/SPARK-23856 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dmitry Mikhailov >Assignee: Apache Spark >Priority: Minor > > It would be nice if a user could set the jdbc setQueryTimeout option when > running jdbc in Spark. I think it can be easily implemented by adding option > field to _JDBCOptions_ class and applying this option when initializing jdbc > statements in _JDBCRDD_ class. But only some DB vendors support this jdbc > feature. Is it worth starting a work on this option? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23580) Interpreted mode fallback should be implemented for all expressions & projections
[ https://issues.apache.org/jira/browse/SPARK-23580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454918#comment-16454918 ] Bruce Robbins commented on SPARK-23580: --- Should SortPrefix also get this treatment? > Interpreted mode fallback should be implemented for all expressions & > projections > - > > Key: SPARK-23580 > URL: https://issues.apache.org/jira/browse/SPARK-23580 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > Labels: release-notes > > Spark SQL currently does not support interpreted mode for all expressions and > projections. This is a problem for scenario's where were code generation does > not work, or blows past the JVM class limits. We currently cannot gracefully > fallback. > This ticket is an umbrella to fix this class of problem in Spark SQL. This > work can be divided into two main area's: > - Add interpreted versions for all dataset related expressions. > - Add an interpreted version of {{GenerateUnsafeProjection}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24057) put the real data type in the AssertionError message
[ https://issues.apache.org/jira/browse/SPARK-24057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-24057. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21159 [https://github.com/apache/spark/pull/21159] > put the real data type in the AssertionError message > > > Key: SPARK-24057 > URL: https://issues.apache.org/jira/browse/SPARK-24057 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 2.4.0 > > > I had a wrong data type in one of my tests and got the following message: > /spark/python/pyspark/sql/types.py", line 405, in __init__ > assert isinstance(dataType, DataType), "dataType should be DataType" > AssertionError: dataType should be DataType > I checked types.py, line 405, in __init__, it has > {{assert isinstance(dataType, DataType), "dataType should be DataType"}} > I think it meant to be > {{assert isinstance(dataType, DataType), "%s should be %s" % (dataType, > DataType)}} > so the error message will be something like > {{AssertionError: should be 'pyspark.sql.types.DataType'>}} > There are a couple of other places that have the same problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24057) put the real data type in the AssertionError message
[ https://issues.apache.org/jira/browse/SPARK-24057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-24057: Assignee: Huaxin Gao > put the real data type in the AssertionError message > > > Key: SPARK-24057 > URL: https://issues.apache.org/jira/browse/SPARK-24057 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 2.4.0 > > > I had a wrong data type in one of my tests and got the following message: > /spark/python/pyspark/sql/types.py", line 405, in __init__ > assert isinstance(dataType, DataType), "dataType should be DataType" > AssertionError: dataType should be DataType > I checked types.py, line 405, in __init__, it has > {{assert isinstance(dataType, DataType), "dataType should be DataType"}} > I think it meant to be > {{assert isinstance(dataType, DataType), "%s should be %s" % (dataType, > DataType)}} > so the error message will be something like > {{AssertionError: should be 'pyspark.sql.types.DataType'>}} > There are a couple of other places that have the same problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24105) Spark 2.3.0 on kubernetes
[ https://issues.apache.org/jira/browse/SPARK-24105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454847#comment-16454847 ] Anirudh Ramanathan commented on SPARK-24105: > To avoid this deadlock, its required to support node selector (in future > affinity/anti-affinity) configruation by driver & executor. Would inter-pod anti-affinity be a better bet here for this use-case? In the extreme case, this is a gang scheduling issue IMO, where we don't want to schedule drivers if there are no executors that can be scheduled. There's some work on gang scheduling ongoing in https://github.com/kubernetes/kubernetes/issues/61012 under sig-scheduling. > Spark 2.3.0 on kubernetes > - > > Key: SPARK-24105 > URL: https://issues.apache.org/jira/browse/SPARK-24105 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Lenin >Priority: Major > > Right now its only possible to define node selector configurations > thruspark.kubernetes.node.selector.[labelKey]. This gets used for both driver > & executor pods. Without the capability to isolate driver & executor pods, > the cluster can run into a livelock scenario, where if there are a lot of > spark submits, can cause the driver pods to fill up the cluster capacity, > with no room for executor pods to do any work. > > To avoid this deadlock, its required to support node selector (in future > affinity/anti-affinity) configruation by driver & executor. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24106) Spark Structure Streaming with RF model taking long time in processing probability for each mini batch
[ https://issues.apache.org/jira/browse/SPARK-24106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tamilselvan Veeramani updated SPARK-24106: -- Target Version/s: 2.3.0, 2.2.1 (was: 2.2.1, 2.3.0) Issue Type: Improvement (was: Bug) > Spark Structure Streaming with RF model taking long time in processing > probability for each mini batch > -- > > Key: SPARK-24106 > URL: https://issues.apache.org/jira/browse/SPARK-24106 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0, 2.2.1, 2.3.0 > Environment: Spark yarn / Standalone cluster > 2 master nodes - 32 cores - 124 GB > 9 worker nodes - 32 cores - 124 GB > Kafka input and output topic with 6 partition >Reporter: Tamilselvan Veeramani >Priority: Major > Labels: performance > Fix For: 2.3.0, 2.4.0 > > > RandomForestClassificationModel broadcasted to executors for every mini batch > in spark streaming while try to find probability > RF model size 45MB > spark kafka streaming job jar size 8 MB (including kafka dependency’s) > following log show model broad cast to executors for every mini batch when we > call rf_model.transform(dataset).select("probability"). > due to which task deserialization time also increases comes to 6 to 7 second > for 45MB of rf model, although processing time is just 400 to 600 ms for mini > batch > 18/04/15 03:21:23 INFO KafkaSource: GetBatch generating RDD of offset range: > KafkaSourceRDDOffsetRange(Kafka_input_topic-0,242,242,Some(executor_xx.xxx.xx.110_2)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-1,239,239,Some(executor_xx.xxx.xx.107_0)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-2,241,241,Some(executor_xx.xxx.xx.102_3)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-3,238,239,Some(executor_xx.xxx.xx.138_4)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-4,240,240,Some(executor_xx.xxx.xx.137_1)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-5,242,242,Some(executor_xx.xxx.xx.111_5)) > 18/04/15 03:21:24 INFO SparkContext: Starting job: start at App.java:106 > 18/04/15 03:21:31 INFO BlockManagerInfo: Added broadcast_92_piece0 in memory > on xx.xxx.xx.137:44682 (size: 62.6 MB, free: 1942.0 MB) > After 2 to 3 weeks of struggle, I found a potentially solution which will > help many people who is looking to use RF model for “probability” in real > time streaming context > Since RandomForestClassificationModel class of transformImpl method > implements only “prediction” in current version of spark. Which can be > leveraged to implement “probability” also in RandomForestClassificationModel > class of transformImpl method. > I have modified the code and implemented in our server and it’s working as > fast as 400ms to 500ms for every mini batch > I see many people our there facing this issue and no solution provided in any > of the forums, Can you please review and put this fix in next release ? thanks -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24036) Stateful operators in continuous processing
[ https://issues.apache.org/jira/browse/SPARK-24036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454788#comment-16454788 ] Jose Torres commented on SPARK-24036: - https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE I wrote a quick doc summarizing my thoughts. TLDR is: * I think it's better to not reuse the existing shuffle infrastructure - we'll have to do more work to get good performance later, but current shuffle has very bad characteristics for what continuous processing is trying to do. In particular I doubt we'd be able to maintain millisecond-scale latency with anything like UnsafeShuffleWriter. * It's a small diff on top of a working shuffle to support exactly-once state management. I don't think the coordinator needs to worry about stateful operators; a writer will never commit if a stateful operator below it fails to checkpoint, and the stateful operator itself can rewind if it commits an epoch that ends up failing. Let me know what you two think. I'll send this out to the dev list if it looks reasonable, and then we can start thinking about how this breaks down into individual tasks. > Stateful operators in continuous processing > --- > > Key: SPARK-24036 > URL: https://issues.apache.org/jira/browse/SPARK-24036 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > > The first iteration of continuous processing in Spark 2.3 does not work with > stateful operators. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24106) Spark Structure Streaming with RF model taking long time in processing probability for each mini batch
[ https://issues.apache.org/jira/browse/SPARK-24106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tamilselvan Veeramani updated SPARK-24106: -- Target Version/s: 2.3.0, 2.2.1 (was: 2.2.1, 2.3.0) Component/s: (was: MLlib) ML > Spark Structure Streaming with RF model taking long time in processing > probability for each mini batch > -- > > Key: SPARK-24106 > URL: https://issues.apache.org/jira/browse/SPARK-24106 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0, 2.2.1, 2.3.0 > Environment: Spark yarn / Standalone cluster > 2 master nodes - 32 cores - 124 GB > 9 worker nodes - 32 cores - 124 GB > Kafka input and output topic with 6 partition >Reporter: Tamilselvan Veeramani >Priority: Major > Labels: performance > Fix For: 2.3.0, 2.4.0 > > > RandomForestClassificationModel broadcasted to executors for every mini batch > in spark streaming while try to find probability > RF model size 45MB > spark kafka streaming job jar size 8 MB (including kafka dependency’s) > following log show model broad cast to executors for every mini batch when we > call rf_model.transform(dataset).select("probability"). > due to which task deserialization time also increases comes to 6 to 7 second > for 45MB of rf model, although processing time is just 400 to 600 ms for mini > batch > 18/04/15 03:21:23 INFO KafkaSource: GetBatch generating RDD of offset range: > KafkaSourceRDDOffsetRange(Kafka_input_topic-0,242,242,Some(executor_xx.xxx.xx.110_2)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-1,239,239,Some(executor_xx.xxx.xx.107_0)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-2,241,241,Some(executor_xx.xxx.xx.102_3)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-3,238,239,Some(executor_xx.xxx.xx.138_4)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-4,240,240,Some(executor_xx.xxx.xx.137_1)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-5,242,242,Some(executor_xx.xxx.xx.111_5)) > 18/04/15 03:21:24 INFO SparkContext: Starting job: start at App.java:106 > 18/04/15 03:21:31 INFO BlockManagerInfo: Added broadcast_92_piece0 in memory > on xx.xxx.xx.137:44682 (size: 62.6 MB, free: 1942.0 MB) > After 2 to 3 weeks of struggle, I found a potentially solution which will > help many people who is looking to use RF model for “probability” in real > time streaming context > Since RandomForestClassificationModel class of transformImpl method > implements only “prediction” in current version of spark. Which can be > leveraged to implement “probability” also in RandomForestClassificationModel > class of transformImpl method. > I have modified the code and implemented in our server and it’s working as > fast as 400ms to 500ms for every mini batch > I see many people our there facing this issue and no solution provided in any > of the forums, Can you please review and put this fix in next release ? thanks -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24106) Spark Structure Streaming with RF model taking long time in processing probability for each mini batch
[ https://issues.apache.org/jira/browse/SPARK-24106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454774#comment-16454774 ] Tamilselvan Veeramani commented on SPARK-24106: --- I have the code change ready and tested in our server. I can provide for your review. > Spark Structure Streaming with RF model taking long time in processing > probability for each mini batch > -- > > Key: SPARK-24106 > URL: https://issues.apache.org/jira/browse/SPARK-24106 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.2.0, 2.2.1, 2.3.0 > Environment: Spark yarn / Standalone cluster > 2 master nodes - 32 cores - 124 GB > 9 worker nodes - 32 cores - 124 GB > Kafka input and output topic with 6 partition >Reporter: Tamilselvan Veeramani >Priority: Major > Labels: performance > Fix For: 2.3.0, 2.4.0 > > > RandomForestClassificationModel broadcasted to executors for every mini batch > in spark streaming while try to find probability > RF model size 45MB > spark kafka streaming job jar size 8 MB (including kafka dependency’s) > following log show model broad cast to executors for every mini batch when we > call rf_model.transform(dataset).select("probability"). > due to which task deserialization time also increases comes to 6 to 7 second > for 45MB of rf model, although processing time is just 400 to 600 ms for mini > batch > 18/04/15 03:21:23 INFO KafkaSource: GetBatch generating RDD of offset range: > KafkaSourceRDDOffsetRange(Kafka_input_topic-0,242,242,Some(executor_xx.xxx.xx.110_2)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-1,239,239,Some(executor_xx.xxx.xx.107_0)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-2,241,241,Some(executor_xx.xxx.xx.102_3)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-3,238,239,Some(executor_xx.xxx.xx.138_4)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-4,240,240,Some(executor_xx.xxx.xx.137_1)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-5,242,242,Some(executor_xx.xxx.xx.111_5)) > 18/04/15 03:21:24 INFO SparkContext: Starting job: start at App.java:106 > 18/04/15 03:21:31 INFO BlockManagerInfo: Added broadcast_92_piece0 in memory > on xx.xxx.xx.137:44682 (size: 62.6 MB, free: 1942.0 MB) > After 2 to 3 weeks of struggle, I found a potentially solution which will > help many people who is looking to use RF model for “probability” in real > time streaming context > Since RandomForestClassificationModel class of transformImpl method > implements only “prediction” in current version of spark. Which can be > leveraged to implement “probability” also in RandomForestClassificationModel > class of transformImpl method. > I have modified the code and implemented in our server and it’s working as > fast as 400ms to 500ms for every mini batch > I see many people our there facing this issue and no solution provided in any > of the forums, Can you please review and put this fix in next release ? thanks -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24106) Spark Structure Streaming with RF model taking long time in processing probability for each mini batch
Tamilselvan Veeramani created SPARK-24106: - Summary: Spark Structure Streaming with RF model taking long time in processing probability for each mini batch Key: SPARK-24106 URL: https://issues.apache.org/jira/browse/SPARK-24106 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 2.3.0, 2.2.1, 2.2.0 Environment: Spark yarn / Standalone cluster 2 master nodes - 32 cores - 124 GB 9 worker nodes - 32 cores - 124 GB Kafka input and output topic with 6 partition Reporter: Tamilselvan Veeramani Fix For: 2.4.0, 2.3.0 RandomForestClassificationModel broadcasted to executors for every mini batch in spark streaming while try to find probability RF model size 45MB spark kafka streaming job jar size 8 MB (including kafka dependency’s) following log show model broad cast to executors for every mini batch when we call rf_model.transform(dataset).select("probability"). due to which task deserialization time also increases comes to 6 to 7 second for 45MB of rf model, although processing time is just 400 to 600 ms for mini batch 18/04/15 03:21:23 INFO KafkaSource: GetBatch generating RDD of offset range: KafkaSourceRDDOffsetRange(Kafka_input_topic-0,242,242,Some(executor_xx.xxx.xx.110_2)), KafkaSourceRDDOffsetRange(Kafka_input_topic-1,239,239,Some(executor_xx.xxx.xx.107_0)), KafkaSourceRDDOffsetRange(Kafka_input_topic-2,241,241,Some(executor_xx.xxx.xx.102_3)), KafkaSourceRDDOffsetRange(Kafka_input_topic-3,238,239,Some(executor_xx.xxx.xx.138_4)), KafkaSourceRDDOffsetRange(Kafka_input_topic-4,240,240,Some(executor_xx.xxx.xx.137_1)), KafkaSourceRDDOffsetRange(Kafka_input_topic-5,242,242,Some(executor_xx.xxx.xx.111_5)) 18/04/15 03:21:24 INFO SparkContext: Starting job: start at App.java:106 18/04/15 03:21:31 INFO BlockManagerInfo: Added broadcast_92_piece0 in memory on xx.xxx.xx.137:44682 (size: 62.6 MB, free: 1942.0 MB) After 2 to 3 weeks of struggle, I found a potentially solution which will help many people who is looking to use RF model for “probability” in real time streaming context Since RandomForestClassificationModel class of transformImpl method implements only “prediction” in current version of spark. Which can be leveraged to implement “probability” also in RandomForestClassificationModel class of transformImpl method. I have modified the code and implemented in our server and it’s working as fast as 400ms to 500ms for every mini batch I see many people our there facing this issue and no solution provided in any of the forums, Can you please review and put this fix in next release ? thanks -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24068) CSV schema inferring doesn't work for compressed files
[ https://issues.apache.org/jira/browse/SPARK-24068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454761#comment-16454761 ] Maxim Gekk commented on SPARK-24068: The same issue exists in JSON datasource. [~hyukjin.kwon] Do we need a separate ticket for that? > CSV schema inferring doesn't work for compressed files > -- > > Key: SPARK-24068 > URL: https://issues.apache.org/jira/browse/SPARK-24068 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > > Here is a simple csv file compressed by lzo > {code} > $ cat ./test.csv > col1,col2 > a,1 > $ lzop ./test.csv > $ ls > test.csv test.csv.lzo > {code} > Reading test.csv.lzo with LZO codec (see > https://github.com/twitter/hadoop-lzo, for example): > {code:scala} > scala> val ds = spark.read.option("header", true).option("inferSchema", > true).option("io.compression.codecs", > "com.hadoop.compression.lzo.LzopCodec").csv("/Users/maximgekk/tmp/issue/test.csv.lzo") > ds: org.apache.spark.sql.DataFrame = [�LZO?: string] > scala> ds.printSchema > root > |-- �LZO: string (nullable = true) > scala> ds.show > +-+ > |�LZO| > +-+ > |a| > +-+ > {code} > but the file can be read if the schema is specified: > {code} > scala> import org.apache.spark.sql.types._ > scala> val schema = new StructType().add("col1", StringType).add("col2", > IntegerType) > scala> val ds = spark.read.schema(schema).option("header", > true).option("io.compression.codecs", > "com.hadoop.compression.lzo.LzopCodec").csv("test.csv.lzo") > scala> ds.show > +++ > |col1|col2| > +++ > | a| 1| > +++ > {code} > Just in case, schema inferring works for the original uncompressed file: > {code:scala} > scala> spark.read.option("header", true).option("inferSchema", > true).csv("test.csv").printSchema > root > |-- col1: string (nullable = true) > |-- col2: integer (nullable = true) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23120) Add PMML pipeline export support to PySpark
[ https://issues.apache.org/jira/browse/SPARK-23120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454716#comment-16454716 ] Apache Spark commented on SPARK-23120: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/21172 > Add PMML pipeline export support to PySpark > --- > > Key: SPARK-23120 > URL: https://issues.apache.org/jira/browse/SPARK-23120 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: holdenk >Assignee: holdenk >Priority: Major > > Once we have Scala support go back and fill in Python. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23120) Add PMML pipeline export support to PySpark
[ https://issues.apache.org/jira/browse/SPARK-23120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23120: Assignee: holdenk (was: Apache Spark) > Add PMML pipeline export support to PySpark > --- > > Key: SPARK-23120 > URL: https://issues.apache.org/jira/browse/SPARK-23120 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: holdenk >Assignee: holdenk >Priority: Major > > Once we have Scala support go back and fill in Python. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23120) Add PMML pipeline export support to PySpark
[ https://issues.apache.org/jira/browse/SPARK-23120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23120: Assignee: Apache Spark (was: holdenk) > Add PMML pipeline export support to PySpark > --- > > Key: SPARK-23120 > URL: https://issues.apache.org/jira/browse/SPARK-23120 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: holdenk >Assignee: Apache Spark >Priority: Major > > Once we have Scala support go back and fill in Python. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24085) Scalar subquery error
[ https://issues.apache.org/jira/browse/SPARK-24085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454707#comment-16454707 ] Dilip Biswal edited comment on SPARK-24085 at 4/26/18 7:09 PM: --- Working on a fix for this. My current thinking is to not consider subquery expressions in the partition pruning process. was (Author: dkbiswal): Working on a fix for this. > Scalar subquery error > - > > Key: SPARK-24085 > URL: https://issues.apache.org/jira/browse/SPARK-24085 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Alexey Baturin >Priority: Major > > Error > {noformat} > SQL Error: java.lang.UnsupportedOperationException: Cannot evaluate > expression: scalar-subquery{noformat} > Then query a partitioed table based on a parquet file then filter by > partition column by scalar subquery. > Query to reproduce: > {code:sql} > CREATE TABLE test_prc_bug ( > id_value string > ) > partitioned by (id_type string) > location '/tmp/test_prc_bug' > stored as parquet; > insert into test_prc_bug values ('1','a'); > insert into test_prc_bug values ('2','a'); > insert into test_prc_bug values ('3','b'); > insert into test_prc_bug values ('4','b'); > select * from test_prc_bug > where id_type = (select 'b'); > {code} > If table in ORC format it works fine -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24085) Scalar subquery error
[ https://issues.apache.org/jira/browse/SPARK-24085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454707#comment-16454707 ] Dilip Biswal commented on SPARK-24085: -- Working on a fix for this. > Scalar subquery error > - > > Key: SPARK-24085 > URL: https://issues.apache.org/jira/browse/SPARK-24085 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Alexey Baturin >Priority: Major > > Error > {noformat} > SQL Error: java.lang.UnsupportedOperationException: Cannot evaluate > expression: scalar-subquery{noformat} > Then query a partitioed table based on a parquet file then filter by > partition column by scalar subquery. > Query to reproduce: > {code:sql} > CREATE TABLE test_prc_bug ( > id_value string > ) > partitioned by (id_type string) > location '/tmp/test_prc_bug' > stored as parquet; > insert into test_prc_bug values ('1','a'); > insert into test_prc_bug values ('2','a'); > insert into test_prc_bug values ('3','b'); > insert into test_prc_bug values ('4','b'); > select * from test_prc_bug > where id_type = (select 'b'); > {code} > If table in ORC format it works fine -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24105) Spark 2.3.0 on kubernetes
[ https://issues.apache.org/jira/browse/SPARK-24105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lenin updated SPARK-24105: -- Description: Right now its only possible to define node selector configurations thruspark.kubernetes.node.selector.[labelKey]. This gets used for both driver & executor pods. Without the capability to isolate driver & executor pods, the cluster can run into a livelock scenario, where if there are a lot of spark submits, can cause the driver pods to fill up the cluster capacity, with no room for executor pods to do any work. To avoid this deadlock, its required to support node selector (in future affinity/anti-affinity) configruation by driver & executor. was: Right now its only possible to define node selector configurations thruspark.kubernetes.node.selector.[labelKey]. This gets used for both driver & executor pods. Without the capability to isolate driver & executor pods, the cluster can run into a deadlock scenario, where if there are a lot of spark submits, can cause the driver pods to fill up the cluster capacity, with no room for executor pods to do any work. To avoid this deadlock, its required to support node selector (in future affinity/anti-affinity) configruation by driver & executor. > Spark 2.3.0 on kubernetes > - > > Key: SPARK-24105 > URL: https://issues.apache.org/jira/browse/SPARK-24105 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Lenin >Priority: Major > > Right now its only possible to define node selector configurations > thruspark.kubernetes.node.selector.[labelKey]. This gets used for both driver > & executor pods. Without the capability to isolate driver & executor pods, > the cluster can run into a livelock scenario, where if there are a lot of > spark submits, can cause the driver pods to fill up the cluster capacity, > with no room for executor pods to do any work. > > To avoid this deadlock, its required to support node selector (in future > affinity/anti-affinity) configruation by driver & executor. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24104) SQLAppStatusListener overwrites metrics onDriverAccumUpdates instead of updating them
[ https://issues.apache.org/jira/browse/SPARK-24104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24104: Assignee: (was: Apache Spark) > SQLAppStatusListener overwrites metrics onDriverAccumUpdates instead of > updating them > - > > Key: SPARK-24104 > URL: https://issues.apache.org/jira/browse/SPARK-24104 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Juliusz Sompolski >Priority: Major > > SqlAppStatusListener does > {code} > exec.driverAccumUpdates = accumUpdates.toMap > update(exec) > {code} > in onDriverAccumUpdates. > But postDriverMetricUpdates is called multiple time per query, e.g. from each > FileSourceScanExec and BroadcastExchangeExec. > If the update does not really update it in the KV store (depending on > liveUpdatePeriodNs), the previously posted metrics are lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24104) SQLAppStatusListener overwrites metrics onDriverAccumUpdates instead of updating them
[ https://issues.apache.org/jira/browse/SPARK-24104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24104: Assignee: Apache Spark > SQLAppStatusListener overwrites metrics onDriverAccumUpdates instead of > updating them > - > > Key: SPARK-24104 > URL: https://issues.apache.org/jira/browse/SPARK-24104 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Juliusz Sompolski >Assignee: Apache Spark >Priority: Major > > SqlAppStatusListener does > {code} > exec.driverAccumUpdates = accumUpdates.toMap > update(exec) > {code} > in onDriverAccumUpdates. > But postDriverMetricUpdates is called multiple time per query, e.g. from each > FileSourceScanExec and BroadcastExchangeExec. > If the update does not really update it in the KV store (depending on > liveUpdatePeriodNs), the previously posted metrics are lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24104) SQLAppStatusListener overwrites metrics onDriverAccumUpdates instead of updating them
[ https://issues.apache.org/jira/browse/SPARK-24104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454630#comment-16454630 ] Apache Spark commented on SPARK-24104: -- User 'juliuszsompolski' has created a pull request for this issue: https://github.com/apache/spark/pull/21171 > SQLAppStatusListener overwrites metrics onDriverAccumUpdates instead of > updating them > - > > Key: SPARK-24104 > URL: https://issues.apache.org/jira/browse/SPARK-24104 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Juliusz Sompolski >Priority: Major > > SqlAppStatusListener does > {code} > exec.driverAccumUpdates = accumUpdates.toMap > update(exec) > {code} > in onDriverAccumUpdates. > But postDriverMetricUpdates is called multiple time per query, e.g. from each > FileSourceScanExec and BroadcastExchangeExec. > If the update does not really update it in the KV store (depending on > liveUpdatePeriodNs), the previously posted metrics are lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24105) Spark 2.3.0 on kubernetes
Lenin created SPARK-24105: - Summary: Spark 2.3.0 on kubernetes Key: SPARK-24105 URL: https://issues.apache.org/jira/browse/SPARK-24105 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 2.3.0 Reporter: Lenin Right now its only possible to define node selector configurations thruspark.kubernetes.node.selector.[labelKey]. This gets used for both driver & executor pods. Without the capability to isolate driver & executor pods, the cluster can run into a deadlock scenario, where if there are a lot of spark submits, can cause the driver pods to fill up the cluster capacity, with no room for executor pods to do any work. To avoid this deadlock, its required to support node selector (in future affinity/anti-affinity) configruation by driver & executor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23962) Flaky tests from SQLMetricsTestUtils.currentExecutionIds
[ https://issues.apache.org/jira/browse/SPARK-23962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454594#comment-16454594 ] Dongjoon Hyun commented on SPARK-23962: --- [~irashid], [~cloud_fan], Please check the build status on `branch-2.3/sbt`. [https://github.com/apache/spark/pull/21041#issuecomment-384721509] Also, cc [~smilegator]. > Flaky tests from SQLMetricsTestUtils.currentExecutionIds > > > Key: SPARK-23962 > URL: https://issues.apache.org/jira/browse/SPARK-23962 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Minor > Fix For: 2.3.1, 2.4.0 > > Attachments: unit-tests.log > > > I've seen > {{org.apache.spark.sql.execution.metric.SQLMetricsSuite.SortMergeJoin > metrics}} fail > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4150/testReport/org.apache.spark.sql.execution.metric/SQLMetricsSuite/SortMergeJoin_metrics/ > > with > {noformat} > Error Message > org.scalatest.exceptions.TestFailedException: 2 did not equal 1 > Stacktrace > sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 2 did > not equal 1 > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > at > org.apache.spark.sql.execution.metric.SQLMetricsTestUtils$class.getSparkPlanMetrics(SQLMetricsTestUtils.scala:146) > at > org.apache.spark.sql.execution.metric.SQLMetricsSuite.getSparkPlanMetrics(SQLMetricsSuite.scala:33) > at > org.apache.spark.sql.execution.metric.SQLMetricsTestUtils$class.testSparkPlanMetrics(SQLMetricsTestUtils.scala:187) > at > org.apache.spark.sql.execution.metric.SQLMetricsSuite.testSparkPlanMetrics(SQLMetricsSuite.scala:33) > at > org.apache.spark.sql.execution.metric.SQLMetricsSuite$$anonfun$7$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SQLMetricsSuite.scala:188) > at > org.apache.spark.sql.test.SQLTestUtilsBase$class.withTempView(SQLTestUtils.scala:260) > ... > {noformat} > I believe this is because {{SQLMetricsTestUtils.currentExecutionId()}} is > racing with the listener bus. > I'll attach trimmed logs as well -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23842) accessing java from PySpark lambda functions
[ https://issues.apache.org/jira/browse/SPARK-23842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk resolved SPARK-23842. - Resolution: Won't Fix Not supported by the current design, alternatives do exist though. > accessing java from PySpark lambda functions > > > Key: SPARK-23842 > URL: https://issues.apache.org/jira/browse/SPARK-23842 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: Ruslan Dautkhanov >Priority: Major > Labels: cloudpickle, py4j, pyspark > > Copied from https://github.com/bartdag/py4j/issues/311 but it seems to be > more of a Spark issue than py4j.. > |We have a third-party Java library that is distributed to Spark executors > through {{--jars}} parameter. > We want to call a static Java method in that library on executor's side > through Spark's {{map()}} or create an object of that library's class through > {{mapPartitions()}} call. > None of the approaches worked so far. It seems Spark tries to serialize > everything it sees in a lambda function, distribute to executors etc. > I am aware of an older py4j issue/question > [#171|https://github.com/bartdag/py4j/issues/171] but looking at that > discussion isn't helpful. > We thought to create a reference to that "class" through a call like > {{genmodel = spark._jvm.hex.genmodel}}and then operate through py4j to expose > functionality of that library in pyspark executors' lambda functions. > We don't want Spark to try to serialize spark session variables "spark" nor > its reference to py4j gateway {{spark._jvm}} (because it leads to expected > non-serializable exceptions), so tried to "trick" Spark not to try to > serialize those by nested the above {{genmodel = spark._jvm.hex.genmodel}} > into {{exec()}} call. > It led to another issue that {{spark}} (spark session) nor {{sc}} (spark > context) variables seems not available in spark executors' lambda functions. > So we're stuck and don't know how to call a generic java class through py4j > on executor's side (from within {{map}} or {{mapPartitions}} lambda > functions). > It would be an easier adventure from Scala/ Java for Spark as those can > directly call that 3rd-party libraries methods, but our users ask to have a > way to do the same from PySpark.| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23842) accessing java from PySpark lambda functions
[ https://issues.apache.org/jira/browse/SPARK-23842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454585#comment-16454585 ] holdenk commented on SPARK-23842: - So the py4j gateway only exists on the driver program, on the worker programs Spark uses its own method of communicating between the worker and Spark. You might find looking at [https://github.com/sparklingpandas/sparklingml] to be helpful, if you factor your code so that the Java function takes in a DataFrame you can use it that way (or register Java UDFs as shown). > accessing java from PySpark lambda functions > > > Key: SPARK-23842 > URL: https://issues.apache.org/jira/browse/SPARK-23842 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: Ruslan Dautkhanov >Priority: Major > Labels: cloudpickle, py4j, pyspark > > Copied from https://github.com/bartdag/py4j/issues/311 but it seems to be > more of a Spark issue than py4j.. > |We have a third-party Java library that is distributed to Spark executors > through {{--jars}} parameter. > We want to call a static Java method in that library on executor's side > through Spark's {{map()}} or create an object of that library's class through > {{mapPartitions()}} call. > None of the approaches worked so far. It seems Spark tries to serialize > everything it sees in a lambda function, distribute to executors etc. > I am aware of an older py4j issue/question > [#171|https://github.com/bartdag/py4j/issues/171] but looking at that > discussion isn't helpful. > We thought to create a reference to that "class" through a call like > {{genmodel = spark._jvm.hex.genmodel}}and then operate through py4j to expose > functionality of that library in pyspark executors' lambda functions. > We don't want Spark to try to serialize spark session variables "spark" nor > its reference to py4j gateway {{spark._jvm}} (because it leads to expected > non-serializable exceptions), so tried to "trick" Spark not to try to > serialize those by nested the above {{genmodel = spark._jvm.hex.genmodel}} > into {{exec()}} call. > It led to another issue that {{spark}} (spark session) nor {{sc}} (spark > context) variables seems not available in spark executors' lambda functions. > So we're stuck and don't know how to call a generic java class through py4j > on executor's side (from within {{map}} or {{mapPartitions}} lambda > functions). > It would be an easier adventure from Scala/ Java for Spark as those can > directly call that 3rd-party libraries methods, but our users ask to have a > way to do the same from PySpark.| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24104) SQLAppStatusListener overwrites metrics onDriverAccumUpdates instead of updating them
Juliusz Sompolski created SPARK-24104: - Summary: SQLAppStatusListener overwrites metrics onDriverAccumUpdates instead of updating them Key: SPARK-24104 URL: https://issues.apache.org/jira/browse/SPARK-24104 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Juliusz Sompolski SqlAppStatusListener does {code} exec.driverAccumUpdates = accumUpdates.toMap update(exec) {code} in onDriverAccumUpdates. But postDriverMetricUpdates is called multiple time per query, e.g. from each FileSourceScanExec and BroadcastExchangeExec. If the update does not really update it in the KV store (depending on liveUpdatePeriodNs), the previously posted metrics are lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23925) High-order function: repeat(element, count) → array
[ https://issues.apache.org/jira/browse/SPARK-23925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454526#comment-16454526 ] Marek Novotny edited comment on SPARK-23925 at 4/26/18 4:47 PM: [~pepinoflo] Any joy? I can take this one or help if you want? was (Author: mn-mikke): @pepinoflo Any joy? I can take this one or help if you want? > High-order function: repeat(element, count) → array > --- > > Key: SPARK-23925 > URL: https://issues.apache.org/jira/browse/SPARK-23925 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Repeat element for count times. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23925) High-order function: repeat(element, count) → array
[ https://issues.apache.org/jira/browse/SPARK-23925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454526#comment-16454526 ] Marek Novotny commented on SPARK-23925: --- @pepinoflo Any joy? I can take this one or help if you want? > High-order function: repeat(element, count) → array > --- > > Key: SPARK-23925 > URL: https://issues.apache.org/jira/browse/SPARK-23925 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Repeat element for count times. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22732) Add DataSourceV2 streaming APIs
[ https://issues.apache.org/jira/browse/SPARK-22732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454501#comment-16454501 ] Apache Spark commented on SPARK-22732: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/21170 > Add DataSourceV2 streaming APIs > --- > > Key: SPARK-22732 > URL: https://issues.apache.org/jira/browse/SPARK-22732 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jose Torres >Assignee: Jose Torres >Priority: Major > Fix For: 2.3.0 > > > Structured Streaming APIs are currently tucked in a spark internal package. > We need to expose a new version in the DataSourceV2 framework, and add the > APIs required for continuous processing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values
[ https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23715: Assignee: Apache Spark > from_utc_timestamp returns incorrect results for some UTC date/time values > -- > > Key: SPARK-23715 > URL: https://issues.apache.org/jira/browse/SPARK-23715 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bruce Robbins >Assignee: Apache Spark >Priority: Major > > This produces the expected answer: > {noformat} > df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" > ).as("dt")).show > +---+ > | dt| > +---+ > |2018-03-13 07:18:23| > +---+ > {noformat} > However, the equivalent UTC input (but with an explicit timezone) produces a > wrong answer: > {noformat} > df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" > ).as("dt")).show > +---+ > | dt| > +---+ > |2018-03-13 00:18:23| > +---+ > {noformat} > Additionally, the equivalent Unix time (1520921903, which is also > "2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer: > {noformat} > df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" > ).as("dt")).show > +---+ > | dt| > +---+ > |2018-03-13 00:18:23| > +---+ > {noformat} > These issues stem from the fact that the FromUTCTimestamp expression, despite > its name, expects the input to be in the user's local timezone. There is some > magic under the covers to make things work (mostly) as the user expects. > As an example, let's say a user in Los Angeles issues the following: > {noformat} > df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" > ).as("dt")).show > {noformat} > FromUTCTimestamp gets as input a Timestamp (long) value representing > {noformat} > 2018-03-13T06:18:23-07:00 (long value 152094710300) > {noformat} > What FromUTCTimestamp needs instead is > {noformat} > 2018-03-13T06:18:23+00:00 (long value 152092190300) > {noformat} > So, it applies the local timezone's offset to the input timestamp to get the > correct value (152094710300 minus 7 hours is 152092190300). Then it > can process the value and produce the expected output. > When the user explicitly specifies a time zone, FromUTCTimestamp's > assumptions break down. The input is no longer in the local time zone. > Because of the way input data is implicitly casted, FromUTCTimestamp never > knows whether the input data had an explicit timezone. > Here are some gory details: > There is sometimes a mismatch in expectations between the (string => > timestamp) cast and FromUTCTimestamp. Also, since the FromUTCTimestamp > expression never sees the actual input string (the cast "intercepts" the > input and converts it to a long timestamp before FromUTCTimestamp uses the > value), FromUTCTimestamp cannot reject any input value that would exercise > this mismatch in expectations. > There is a similar mismatch in expectations in the (integer => timestamp) > cast and FromUTCTimestamp. As a result, Unix time input almost always > produces incorrect output. > h3. When things work as expected for String input: > When from_utc_timestamp is passed a string time value with no time zone, > DateTimeUtils.stringToTimestamp (called from a Cast expression) treats the > datetime string as though it's in the user's local time zone. Because > DateTimeUtils.stringToTimestamp is a general function, this is reasonable. > As a result, FromUTCTimestamp's input is a timestamp shifted by the local > time zone's offset. FromUTCTimestamp assumes this (or more accurately, a > utility function called by FromUTCTimestamp assumes this), so the first thing > it does is reverse-shift to get it back the correct value. Now that the long > value has been shifted back to the correct timestamp value, it can now > process it (by shifting it again based on the specified time zone). > h3. When things go wrong with String input: > When from_utc_timestamp is passed a string datetime value with an explicit > time zone, stringToTimestamp honors that timezone and ignores the local time > zone. stringToTimestamp does not shift the timestamp by the local timezone's > offset, but by the timezone specified on the datetime string. > Unfortunately, FromUTCTimestamp, which has no insight into the actual input > or the conversion, still assumes the timestamp is shifted by the local time > zone. So it reverse-shifts the long value by the local time zone's offset, > which produces a incorrect timestamp (except in the case where the input > dateti
[jira] [Commented] (SPARK-24096) create table as select not using hive.default.fileformat
[ https://issues.apache.org/jira/browse/SPARK-24096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454463#comment-16454463 ] Yuming Wang commented on SPARK-24096: - Another related PR: https://github.com/apache/spark/pull/14430 > create table as select not using hive.default.fileformat > > > Key: SPARK-24096 > URL: https://issues.apache.org/jira/browse/SPARK-24096 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: StephenZou >Priority: Major > > In my spark conf directory, hive-site.xml have an item indicating orc is the > default file format. > > hive.default.fileformat > orc > > > But when I use "create table as select ..." to create a table, the output > format is plain text. > It works only I use "set hive.default.fileformat=orc" > > Then I walked through the spark code and found in > sparkSqlParser:visitCreateHiveTable(), > val defaultStorage = HiveSerDe.getDefaultStorage(conf) the conf is SQLConf, > that explains the above observation, > "set hive.default.fileformat=orc" is put into conf map, hive-site.xml is not. > > It's quite misleading, How to unify the settings? > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values
[ https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454462#comment-16454462 ] Apache Spark commented on SPARK-23715: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/21169 > from_utc_timestamp returns incorrect results for some UTC date/time values > -- > > Key: SPARK-23715 > URL: https://issues.apache.org/jira/browse/SPARK-23715 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bruce Robbins >Priority: Major > > This produces the expected answer: > {noformat} > df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" > ).as("dt")).show > +---+ > | dt| > +---+ > |2018-03-13 07:18:23| > +---+ > {noformat} > However, the equivalent UTC input (but with an explicit timezone) produces a > wrong answer: > {noformat} > df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" > ).as("dt")).show > +---+ > | dt| > +---+ > |2018-03-13 00:18:23| > +---+ > {noformat} > Additionally, the equivalent Unix time (1520921903, which is also > "2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer: > {noformat} > df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" > ).as("dt")).show > +---+ > | dt| > +---+ > |2018-03-13 00:18:23| > +---+ > {noformat} > These issues stem from the fact that the FromUTCTimestamp expression, despite > its name, expects the input to be in the user's local timezone. There is some > magic under the covers to make things work (mostly) as the user expects. > As an example, let's say a user in Los Angeles issues the following: > {noformat} > df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" > ).as("dt")).show > {noformat} > FromUTCTimestamp gets as input a Timestamp (long) value representing > {noformat} > 2018-03-13T06:18:23-07:00 (long value 152094710300) > {noformat} > What FromUTCTimestamp needs instead is > {noformat} > 2018-03-13T06:18:23+00:00 (long value 152092190300) > {noformat} > So, it applies the local timezone's offset to the input timestamp to get the > correct value (152094710300 minus 7 hours is 152092190300). Then it > can process the value and produce the expected output. > When the user explicitly specifies a time zone, FromUTCTimestamp's > assumptions break down. The input is no longer in the local time zone. > Because of the way input data is implicitly casted, FromUTCTimestamp never > knows whether the input data had an explicit timezone. > Here are some gory details: > There is sometimes a mismatch in expectations between the (string => > timestamp) cast and FromUTCTimestamp. Also, since the FromUTCTimestamp > expression never sees the actual input string (the cast "intercepts" the > input and converts it to a long timestamp before FromUTCTimestamp uses the > value), FromUTCTimestamp cannot reject any input value that would exercise > this mismatch in expectations. > There is a similar mismatch in expectations in the (integer => timestamp) > cast and FromUTCTimestamp. As a result, Unix time input almost always > produces incorrect output. > h3. When things work as expected for String input: > When from_utc_timestamp is passed a string time value with no time zone, > DateTimeUtils.stringToTimestamp (called from a Cast expression) treats the > datetime string as though it's in the user's local time zone. Because > DateTimeUtils.stringToTimestamp is a general function, this is reasonable. > As a result, FromUTCTimestamp's input is a timestamp shifted by the local > time zone's offset. FromUTCTimestamp assumes this (or more accurately, a > utility function called by FromUTCTimestamp assumes this), so the first thing > it does is reverse-shift to get it back the correct value. Now that the long > value has been shifted back to the correct timestamp value, it can now > process it (by shifting it again based on the specified time zone). > h3. When things go wrong with String input: > When from_utc_timestamp is passed a string datetime value with an explicit > time zone, stringToTimestamp honors that timezone and ignores the local time > zone. stringToTimestamp does not shift the timestamp by the local timezone's > offset, but by the timezone specified on the datetime string. > Unfortunately, FromUTCTimestamp, which has no insight into the actual input > or the conversion, still assumes the timestamp is shifted by the local time > zone. So it reverse-shifts the long value by the local time zone's of
[jira] [Assigned] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values
[ https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23715: Assignee: (was: Apache Spark) > from_utc_timestamp returns incorrect results for some UTC date/time values > -- > > Key: SPARK-23715 > URL: https://issues.apache.org/jira/browse/SPARK-23715 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bruce Robbins >Priority: Major > > This produces the expected answer: > {noformat} > df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" > ).as("dt")).show > +---+ > | dt| > +---+ > |2018-03-13 07:18:23| > +---+ > {noformat} > However, the equivalent UTC input (but with an explicit timezone) produces a > wrong answer: > {noformat} > df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" > ).as("dt")).show > +---+ > | dt| > +---+ > |2018-03-13 00:18:23| > +---+ > {noformat} > Additionally, the equivalent Unix time (1520921903, which is also > "2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer: > {noformat} > df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" > ).as("dt")).show > +---+ > | dt| > +---+ > |2018-03-13 00:18:23| > +---+ > {noformat} > These issues stem from the fact that the FromUTCTimestamp expression, despite > its name, expects the input to be in the user's local timezone. There is some > magic under the covers to make things work (mostly) as the user expects. > As an example, let's say a user in Los Angeles issues the following: > {noformat} > df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" > ).as("dt")).show > {noformat} > FromUTCTimestamp gets as input a Timestamp (long) value representing > {noformat} > 2018-03-13T06:18:23-07:00 (long value 152094710300) > {noformat} > What FromUTCTimestamp needs instead is > {noformat} > 2018-03-13T06:18:23+00:00 (long value 152092190300) > {noformat} > So, it applies the local timezone's offset to the input timestamp to get the > correct value (152094710300 minus 7 hours is 152092190300). Then it > can process the value and produce the expected output. > When the user explicitly specifies a time zone, FromUTCTimestamp's > assumptions break down. The input is no longer in the local time zone. > Because of the way input data is implicitly casted, FromUTCTimestamp never > knows whether the input data had an explicit timezone. > Here are some gory details: > There is sometimes a mismatch in expectations between the (string => > timestamp) cast and FromUTCTimestamp. Also, since the FromUTCTimestamp > expression never sees the actual input string (the cast "intercepts" the > input and converts it to a long timestamp before FromUTCTimestamp uses the > value), FromUTCTimestamp cannot reject any input value that would exercise > this mismatch in expectations. > There is a similar mismatch in expectations in the (integer => timestamp) > cast and FromUTCTimestamp. As a result, Unix time input almost always > produces incorrect output. > h3. When things work as expected for String input: > When from_utc_timestamp is passed a string time value with no time zone, > DateTimeUtils.stringToTimestamp (called from a Cast expression) treats the > datetime string as though it's in the user's local time zone. Because > DateTimeUtils.stringToTimestamp is a general function, this is reasonable. > As a result, FromUTCTimestamp's input is a timestamp shifted by the local > time zone's offset. FromUTCTimestamp assumes this (or more accurately, a > utility function called by FromUTCTimestamp assumes this), so the first thing > it does is reverse-shift to get it back the correct value. Now that the long > value has been shifted back to the correct timestamp value, it can now > process it (by shifting it again based on the specified time zone). > h3. When things go wrong with String input: > When from_utc_timestamp is passed a string datetime value with an explicit > time zone, stringToTimestamp honors that timezone and ignores the local time > zone. stringToTimestamp does not shift the timestamp by the local timezone's > offset, but by the timezone specified on the datetime string. > Unfortunately, FromUTCTimestamp, which has no insight into the actual input > or the conversion, still assumes the timestamp is shifted by the local time > zone. So it reverse-shifts the long value by the local time zone's offset, > which produces a incorrect timestamp (except in the case where the input > datetime string just happened t
[jira] [Resolved] (SPARK-24096) create table as select not using hive.default.fileformat
[ https://issues.apache.org/jira/browse/SPARK-24096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-24096. - Resolution: Duplicate > create table as select not using hive.default.fileformat > > > Key: SPARK-24096 > URL: https://issues.apache.org/jira/browse/SPARK-24096 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: StephenZou >Priority: Major > > In my spark conf directory, hive-site.xml have an item indicating orc is the > default file format. > > hive.default.fileformat > orc > > > But when I use "create table as select ..." to create a table, the output > format is plain text. > It works only I use "set hive.default.fileformat=orc" > > Then I walked through the spark code and found in > sparkSqlParser:visitCreateHiveTable(), > val defaultStorage = HiveSerDe.getDefaultStorage(conf) the conf is SQLConf, > that explains the above observation, > "set hive.default.fileformat=orc" is put into conf map, hive-site.xml is not. > > It's quite misleading, How to unify the settings? > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20087) Include accumulators / taskMetrics when sending TaskKilled to onTaskEnd listeners
[ https://issues.apache.org/jira/browse/SPARK-20087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454454#comment-16454454 ] Apache Spark commented on SPARK-20087: -- User 'advancedxy' has created a pull request for this issue: https://github.com/apache/spark/pull/21165 > Include accumulators / taskMetrics when sending TaskKilled to onTaskEnd > listeners > - > > Key: SPARK-20087 > URL: https://issues.apache.org/jira/browse/SPARK-20087 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Charles Lewis >Priority: Major > > When tasks end due to an ExceptionFailure, subscribers to onTaskEnd receive > accumulators / task metrics for that task, if they were still available. > These metrics are not currently sent when tasks are killed intentionally, > such as when a speculative retry finishes, and the original is killed (or > vice versa). Since we're killing these tasks ourselves, these metrics should > almost always exist, and we should treat them the same way as we treat > ExceptionFailures. > Sending these metrics with the TaskKilled end reason makes aggregation across > all tasks in an app more accurate. This data can inform decisions about how > to tune the speculation parameters in order to minimize duplicated work, and > in general, the total cost of an app should include both successful and > failed tasks, if that information exists. > PR: https://github.com/apache/spark/pull/17422 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24101) MulticlassClassificationEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-24101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454445#comment-16454445 ] Apache Spark commented on SPARK-24101: -- User 'imatiach-msft' has created a pull request for this issue: https://github.com/apache/spark/pull/17086 > MulticlassClassificationEvaluator should use sample weight data > --- > > Key: SPARK-24101 > URL: https://issues.apache.org/jira/browse/SPARK-24101 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Ilya Matiach >Priority: Major > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. This breaks model selection using CrossValidator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24101) MulticlassClassificationEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-24101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24101: Assignee: (was: Apache Spark) > MulticlassClassificationEvaluator should use sample weight data > --- > > Key: SPARK-24101 > URL: https://issues.apache.org/jira/browse/SPARK-24101 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Ilya Matiach >Priority: Major > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. This breaks model selection using CrossValidator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24101) MulticlassClassificationEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-24101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24101: Assignee: Apache Spark > MulticlassClassificationEvaluator should use sample weight data > --- > > Key: SPARK-24101 > URL: https://issues.apache.org/jira/browse/SPARK-24101 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Ilya Matiach >Assignee: Apache Spark >Priority: Major > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. This breaks model selection using CrossValidator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24101) MulticlassClassificationEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-24101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilya Matiach updated SPARK-24101: - Description: The LogisticRegression and LinearRegression models support training with a weight column, but the corresponding evaluators do not support computing metrics using those weights. This breaks model selection using CrossValidator. > MulticlassClassificationEvaluator should use sample weight data > --- > > Key: SPARK-24101 > URL: https://issues.apache.org/jira/browse/SPARK-24101 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Ilya Matiach >Priority: Major > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. This breaks model selection using CrossValidator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24102) RegressionEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-24102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilya Matiach updated SPARK-24102: - Description: The LogisticRegression and LinearRegression models support training with a weight column, but the corresponding evaluators do not support computing metrics using those weights. This breaks model selection using CrossValidator. > RegressionEvaluator should use sample weight data > - > > Key: SPARK-24102 > URL: https://issues.apache.org/jira/browse/SPARK-24102 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Ilya Matiach >Priority: Major > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. This breaks model selection using CrossValidator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24103) BinaryClassificationEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-24103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilya Matiach updated SPARK-24103: - Description: The LogisticRegression and LinearRegression models support training with a weight column, but the corresponding evaluators do not support computing metrics using those weights. This breaks model selection using CrossValidator. > BinaryClassificationEvaluator should use sample weight data > --- > > Key: SPARK-24103 > URL: https://issues.apache.org/jira/browse/SPARK-24103 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Ilya Matiach >Priority: Major > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. This breaks model selection using CrossValidator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18693) BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-18693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454435#comment-16454435 ] Ilya Matiach commented on SPARK-18693: -- [~josephkb] sure, I've added 3 JIRAs for tracking and have linked to them: https://issues.apache.org/jira/browse/SPARK-24103 https://issues.apache.org/jira/browse/SPARK-24101 https://issues.apache.org/jira/browse/SPARK-24102 Thank you, Ilya > BinaryClassificationEvaluator, RegressionEvaluator, and > MulticlassClassificationEvaluator should use sample weight data > --- > > Key: SPARK-18693 > URL: https://issues.apache.org/jira/browse/SPARK-18693 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Devesh Parekh >Priority: Major > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. This breaks model selection using CrossValidator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24103) BinaryClassificationEvaluator should use sample weight data
Ilya Matiach created SPARK-24103: Summary: BinaryClassificationEvaluator should use sample weight data Key: SPARK-24103 URL: https://issues.apache.org/jira/browse/SPARK-24103 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.0.2 Reporter: Ilya Matiach -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24102) RegressionEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-24102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilya Matiach updated SPARK-24102: - Issue Type: Improvement (was: Bug) > RegressionEvaluator should use sample weight data > - > > Key: SPARK-24102 > URL: https://issues.apache.org/jira/browse/SPARK-24102 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Ilya Matiach >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24101) MulticlassClassificationEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-24101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilya Matiach updated SPARK-24101: - Issue Type: Improvement (was: Bug) > MulticlassClassificationEvaluator should use sample weight data > --- > > Key: SPARK-24101 > URL: https://issues.apache.org/jira/browse/SPARK-24101 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Ilya Matiach >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24102) RegressionEvaluator should use sample weight data
Ilya Matiach created SPARK-24102: Summary: RegressionEvaluator should use sample weight data Key: SPARK-24102 URL: https://issues.apache.org/jira/browse/SPARK-24102 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.0.2 Reporter: Ilya Matiach -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24101) MulticlassClassificationEvaluator should use sample weight data
Ilya Matiach created SPARK-24101: Summary: MulticlassClassificationEvaluator should use sample weight data Key: SPARK-24101 URL: https://issues.apache.org/jira/browse/SPARK-24101 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.0.2 Reporter: Ilya Matiach -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23933) High-order function: map(array, array) → map
[ https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454393#comment-16454393 ] Alex Wajda commented on SPARK-23933: Oh, now I got it, thanks. I overlooked the case when the keys could be themselves arrays. Looks like another function name is needed, e.g. to_map or zip_map. > High-order function: map(array, array) → map > --- > > Key: SPARK-23933 > URL: https://issues.apache.org/jira/browse/SPARK-23933 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/map.html > Returns a map created using the given key/value arrays. > {noformat} > SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4} > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23830) Spark on YARN in cluster deploy mode fail with NullPointerException when a Spark application is a Scala class not object
[ https://issues.apache.org/jira/browse/SPARK-23830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454365#comment-16454365 ] Eric Maynard commented on SPARK-23830: -- [~jerryshao] I agree we should not support using a {{class}}. However, I also believe that it's bad practice to throw a non-descriptive NPE. I created a PR here which adds more useful logging in the event that a proper main method isn't found and prevents throwing an NPE: https://github.com/apache/spark/pull/21168 > Spark on YARN in cluster deploy mode fail with NullPointerException when a > Spark application is a Scala class not object > > > Key: SPARK-23830 > URL: https://issues.apache.org/jira/browse/SPARK-23830 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Priority: Trivial > > As reported on StackOverflow in [Why does Spark on YARN fail with “Exception > in thread ”Driver“ > java.lang.NullPointerException”?|https://stackoverflow.com/q/49564334/1305344] > the following Spark application fails with {{Exception in thread "Driver" > java.lang.NullPointerException}} with Spark on YARN in cluster deploy mode: > {code} > class MyClass { > def main(args: Array[String]): Unit = { > val c = new MyClass() > c.process() > } > def process(): Unit = { > val sparkConf = new SparkConf().setAppName("my-test") > val sparkSession: SparkSession = > SparkSession.builder().config(sparkConf).getOrCreate() > import sparkSession.implicits._ > > } > ... > } > {code} > The exception is as follows: > {code} > 18/03/29 20:07:52 INFO ApplicationMaster: Starting the user application in a > separate Thread > 18/03/29 20:07:52 INFO ApplicationMaster: Waiting for spark context > initialization... > Exception in thread "Driver" java.lang.NullPointerException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637) > {code} > I think the reason for the exception {{Exception in thread "Driver" > java.lang.NullPointerException}} is due to [the following > code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L700-L701]: > {code} > val mainMethod = userClassLoader.loadClass(args.userClass) > .getMethod("main", classOf[Array[String]]) > {code} > So when {{mainMethod}} is used in [the following > code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L706] > it simply gives NPE. > {code} > mainMethod.invoke(null, userArgs.toArray) > {code} > That could be easily avoided with an extra check if the {{mainMethod}} is > initialized and give a user a message what may have been a reason. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23830) Spark on YARN in cluster deploy mode fail with NullPointerException when a Spark application is a Scala class not object
[ https://issues.apache.org/jira/browse/SPARK-23830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23830: Assignee: (was: Apache Spark) > Spark on YARN in cluster deploy mode fail with NullPointerException when a > Spark application is a Scala class not object > > > Key: SPARK-23830 > URL: https://issues.apache.org/jira/browse/SPARK-23830 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Priority: Trivial > > As reported on StackOverflow in [Why does Spark on YARN fail with “Exception > in thread ”Driver“ > java.lang.NullPointerException”?|https://stackoverflow.com/q/49564334/1305344] > the following Spark application fails with {{Exception in thread "Driver" > java.lang.NullPointerException}} with Spark on YARN in cluster deploy mode: > {code} > class MyClass { > def main(args: Array[String]): Unit = { > val c = new MyClass() > c.process() > } > def process(): Unit = { > val sparkConf = new SparkConf().setAppName("my-test") > val sparkSession: SparkSession = > SparkSession.builder().config(sparkConf).getOrCreate() > import sparkSession.implicits._ > > } > ... > } > {code} > The exception is as follows: > {code} > 18/03/29 20:07:52 INFO ApplicationMaster: Starting the user application in a > separate Thread > 18/03/29 20:07:52 INFO ApplicationMaster: Waiting for spark context > initialization... > Exception in thread "Driver" java.lang.NullPointerException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637) > {code} > I think the reason for the exception {{Exception in thread "Driver" > java.lang.NullPointerException}} is due to [the following > code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L700-L701]: > {code} > val mainMethod = userClassLoader.loadClass(args.userClass) > .getMethod("main", classOf[Array[String]]) > {code} > So when {{mainMethod}} is used in [the following > code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L706] > it simply gives NPE. > {code} > mainMethod.invoke(null, userArgs.toArray) > {code} > That could be easily avoided with an extra check if the {{mainMethod}} is > initialized and give a user a message what may have been a reason. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23830) Spark on YARN in cluster deploy mode fail with NullPointerException when a Spark application is a Scala class not object
[ https://issues.apache.org/jira/browse/SPARK-23830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23830: Assignee: Apache Spark > Spark on YARN in cluster deploy mode fail with NullPointerException when a > Spark application is a Scala class not object > > > Key: SPARK-23830 > URL: https://issues.apache.org/jira/browse/SPARK-23830 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Assignee: Apache Spark >Priority: Trivial > > As reported on StackOverflow in [Why does Spark on YARN fail with “Exception > in thread ”Driver“ > java.lang.NullPointerException”?|https://stackoverflow.com/q/49564334/1305344] > the following Spark application fails with {{Exception in thread "Driver" > java.lang.NullPointerException}} with Spark on YARN in cluster deploy mode: > {code} > class MyClass { > def main(args: Array[String]): Unit = { > val c = new MyClass() > c.process() > } > def process(): Unit = { > val sparkConf = new SparkConf().setAppName("my-test") > val sparkSession: SparkSession = > SparkSession.builder().config(sparkConf).getOrCreate() > import sparkSession.implicits._ > > } > ... > } > {code} > The exception is as follows: > {code} > 18/03/29 20:07:52 INFO ApplicationMaster: Starting the user application in a > separate Thread > 18/03/29 20:07:52 INFO ApplicationMaster: Waiting for spark context > initialization... > Exception in thread "Driver" java.lang.NullPointerException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637) > {code} > I think the reason for the exception {{Exception in thread "Driver" > java.lang.NullPointerException}} is due to [the following > code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L700-L701]: > {code} > val mainMethod = userClassLoader.loadClass(args.userClass) > .getMethod("main", classOf[Array[String]]) > {code} > So when {{mainMethod}} is used in [the following > code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L706] > it simply gives NPE. > {code} > mainMethod.invoke(null, userArgs.toArray) > {code} > That could be easily avoided with an extra check if the {{mainMethod}} is > initialized and give a user a message what may have been a reason. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23830) Spark on YARN in cluster deploy mode fail with NullPointerException when a Spark application is a Scala class not object
[ https://issues.apache.org/jira/browse/SPARK-23830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454358#comment-16454358 ] Apache Spark commented on SPARK-23830: -- User 'eric-maynard' has created a pull request for this issue: https://github.com/apache/spark/pull/21168 > Spark on YARN in cluster deploy mode fail with NullPointerException when a > Spark application is a Scala class not object > > > Key: SPARK-23830 > URL: https://issues.apache.org/jira/browse/SPARK-23830 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Priority: Trivial > > As reported on StackOverflow in [Why does Spark on YARN fail with “Exception > in thread ”Driver“ > java.lang.NullPointerException”?|https://stackoverflow.com/q/49564334/1305344] > the following Spark application fails with {{Exception in thread "Driver" > java.lang.NullPointerException}} with Spark on YARN in cluster deploy mode: > {code} > class MyClass { > def main(args: Array[String]): Unit = { > val c = new MyClass() > c.process() > } > def process(): Unit = { > val sparkConf = new SparkConf().setAppName("my-test") > val sparkSession: SparkSession = > SparkSession.builder().config(sparkConf).getOrCreate() > import sparkSession.implicits._ > > } > ... > } > {code} > The exception is as follows: > {code} > 18/03/29 20:07:52 INFO ApplicationMaster: Starting the user application in a > separate Thread > 18/03/29 20:07:52 INFO ApplicationMaster: Waiting for spark context > initialization... > Exception in thread "Driver" java.lang.NullPointerException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637) > {code} > I think the reason for the exception {{Exception in thread "Driver" > java.lang.NullPointerException}} is due to [the following > code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L700-L701]: > {code} > val mainMethod = userClassLoader.loadClass(args.userClass) > .getMethod("main", classOf[Array[String]]) > {code} > So when {{mainMethod}} is used in [the following > code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L706] > it simply gives NPE. > {code} > mainMethod.invoke(null, userArgs.toArray) > {code} > That could be easily avoided with an extra check if the {{mainMethod}} is > initialized and give a user a message what may have been a reason. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore
[ https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454278#comment-16454278 ] Tavis Barr commented on SPARK-13446: Enjoy! This is with Spark 2.3.0 and Hive 2.2.0 scala> spark2.sql("SHOW DATABASES").show java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT at org.apache.spark.sql.hive.HiveUtils$.formatTimeVarsForHiveClient(HiveUtils.scala:205) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:286) at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66) at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:195) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:194) at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114) at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102) at org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:39) at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:54) at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52) at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1.(HiveSessionStateBuilder.scala:69) at org.apache.spark.sql.hive.HiveSessionStateBuilder.analyzer(HiveSessionStateBuilder.scala:69) at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293) at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293) at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:79) at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:79) at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57) at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638) ... 51 elided > Spark need to support reading data from Hive 2.0.0 metastore > > > Key: SPARK-13446 > URL: https://issues.apache.org/jira/browse/SPARK-13446 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Lifeng Wang >Assignee: Xiao Li >Priority: Major > Fix For: 2.2.0 > > > Spark provided HIveContext class to read data from hive metastore directly. > While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has > released, it's better to upgrade to support Hive 2.0.0. > {noformat} > 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI > thrift://hsw-node13:9083 > 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current > connections: 1 > 16/02/23 02:35:02 INFO metastore: Connected to metastore. > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185) > at > org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422) > at > org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422) > at > org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421) > at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values
[ https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454221#comment-16454221 ] Wenchen Fan edited comment on SPARK-23715 at 4/26/18 1:53 PM: -- It seems the `from_utc_timestamp` doesn't make a lot of sense in Spark SQL. The timestamp in Spark SQL is TIMESTAMP WITH LOCAL TIME ZONE according to the SQL standard. Physically it stores microseconds from unix epoch, and when it's involved in timezone aware operations, like convert to string format, get the hour component, etc., Spark SQL uses session local timezone to interpret it. That said, the timestamp in Spark does not carry the timezone information, so `from_utc_timestamp` doesn't make sense in Spark to change the timezone of a timestamp. `from_utc_timestamp` was added in Spark 1.5, I think we can't remove it now, we should think of a reasonable definition for it to make it work as users expect. What do users expect? {code} scala> sql("select from_utc_timestamp('2018-01-01 00:00:00', 'GMT+8')").show +-+ |from_utc_timestamp(CAST(2018-01-01 00:00:00 AS TIMESTAMP), GMT+8)| +-+ | 2018-01-01 08:00:00| +-+ scala> spark.conf.set("spark.sql.session.timeZone", "GMT+4") scala> sql("select from_utc_timestamp('2018-01-01 00:00:00', 'GMT+8')").show +-+ |from_utc_timestamp(CAST(2018-01-01 00:00:00 AS TIMESTAMP), GMT+8)| +-+ | 2018-01-01 08:00:00| +-+ {code} The expected behavior is to shift the timestamp from UTC to the specified timezone, assuming the timestamp is timezone-agnostic. So the session local timezone should not affect the result. This also means the timestamp string can't carry the timezone and this is a bug in Spark. For integer input: {code} scala> java.util.TimeZone.getDefault.getID res27: String = Asia/Shanghai // This is GMT+8 scala> sql("select from_utc_timestamp(cast(0 as timestamp), 'GMT+8')").show +---+ |from_utc_timestamp(CAST(0 AS TIMESTAMP), GMT+8)| +---+ |1970-01-01 16:00:00| +---+ scala> spark.conf.set("spark.sql.session.timeZone", "GMT+4") scala> sql("select from_utc_timestamp(cast(0 as timestamp), 'GMT+8')").show +---+ |from_utc_timestamp(CAST(0 AS TIMESTAMP), GMT+8)| +---+ |1970-01-01 12:00:00| +---+ {code} so what happened is, `cast(0 as timestamp)` assumes the input integer is seconds from unix epoch, and it converts the seconds to microseconds and done, no timezone information is needed. According to the semantic of Spark timestamp, `cast(0 as timestamp)` is effectively `1970-01-01 08:00:00`(local timezone is GMT+8) as an input to `from_utc_timestamp`, and that's why the result is `1970-01-01 16:00:00`. And the result changes if the local timezone changes. Since this behavior is consistent with Hive, let's stick with it. Personly I don't think it's the corrected behavior but it's too late to change, and we don't have to change as long as it's well defined, i.e. the result is deterministic. Now the only thing we need to change is, if input is string, return null if it contains timezone. was (Author: cloud_fan): It seems the `from_utc_timestamp` doesn't make a lot of sense in Spark SQL. The timestamp in Spark SQL is TIMESTAMP WITH LOCAL TIME ZONE according to the SQL standard. Physically it stores microseconds from unix epoch, and when it's involved in timezone aware operations, like convert to string format, get the hour component, etc., Spark SQL uses session local timezone to interpret it. That said, the timestamp in Spark does not carry the timezone information, so `from_utc_timestamp` doesn't make sense in Spark to change the timezone of a timestamp. `from_utc_timestamp` was added in Spark 1.5, I think we can't remove it now, we should think of a reasonable definition for it to make it work as users expect. What do users expect? {code} scala> sql("select from_utc_timestamp('2018-01-01 00:00:00', 'GMT+8')").show +-+ |from_utc_timestamp(CAST(2018-01-01 00:00:00 AS TIMESTAMP), GMT+8)| +-+ |
[jira] [Commented] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values
[ https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454221#comment-16454221 ] Wenchen Fan commented on SPARK-23715: - It seems the `from_utc_timestamp` doesn't make a lot of sense in Spark SQL. The timestamp in Spark SQL is TIMESTAMP WITH LOCAL TIME ZONE according to the SQL standard. Physically it stores microseconds from unix epoch, and when it's involved in timezone aware operations, like convert to string format, get the hour component, etc., Spark SQL uses session local timezone to interpret it. That said, the timestamp in Spark does not carry the timezone information, so `from_utc_timestamp` doesn't make sense in Spark to change the timezone of a timestamp. `from_utc_timestamp` was added in Spark 1.5, I think we can't remove it now, we should think of a reasonable definition for it to make it work as users expect. What do users expect? {code} scala> sql("select from_utc_timestamp('2018-01-01 00:00:00', 'GMT+8')").show +-+ |from_utc_timestamp(CAST(2018-01-01 00:00:00 AS TIMESTAMP), GMT+8)| +-+ | 2018-01-01 08:00:00| +-+ scala> spark.conf.set("spark.sql.session.timeZone", "GMT+4") scala> sql("select from_utc_timestamp('2018-01-01 00:00:00', 'GMT+8')").show +-+ |from_utc_timestamp(CAST(2018-01-01 00:00:00 AS TIMESTAMP), GMT+8)| +-+ | 2018-01-01 08:00:00| +-+ {code} The expected behavior is to shift the timestamp from UTC to the specified timezone, assuming the timestamp is timezone-agnostic. So the session local timezone should not affect the result. This also means the timestamp string can't carry the timezone and this is a bug in Spark. For integer input, {code} scala> java.util.TimeZone.getDefault.getID res27: String = Asia/Shanghai // This is GMT+8 scala> sql("select from_utc_timestamp(cast(0 as timestamp), 'GMT+8')").show +---+ |from_utc_timestamp(CAST(0 AS TIMESTAMP), GMT+8)| +---+ |1970-01-01 16:00:00| +---+ scala> spark.conf.set("spark.sql.session.timeZone", "GMT+4") scala> sql("select from_utc_timestamp(cast(0 as timestamp), 'GMT+8')").show +---+ |from_utc_timestamp(CAST(0 AS TIMESTAMP), GMT+8)| +---+ |1970-01-01 12:00:00| +---+ {code} so what happened is, `cast(0 as timestamp)` assumes the input integer is seconds from unix epoch, and it converts the seconds to microseconds and done, no timezone information is needed. According to the semantic of Spark timestamp, `cast(0 as timestamp)` is effectively `1970-01-01 08:00:00`(local timezone is GMT+8) as an input to `from_utc_timestamp`, and that's why the result is `1970-01-01 16:00:00`. And the result changes if the local timezone changes. Since this behavior is consistent with Hive, let's stick with it. Now the only thing we need to change is, if input is string, return null if it contains timezone. > from_utc_timestamp returns incorrect results for some UTC date/time values > -- > > Key: SPARK-23715 > URL: https://issues.apache.org/jira/browse/SPARK-23715 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bruce Robbins >Priority: Major > > This produces the expected answer: > {noformat} > df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" > ).as("dt")).show > +---+ > | dt| > +---+ > |2018-03-13 07:18:23| > +---+ > {noformat} > However, the equivalent UTC input (but with an explicit timezone) produces a > wrong answer: > {noformat} > df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" > ).as("dt")).show > +---+ > | dt| > +---+ > |2018-03-13 00:18:23| > +---+ > {noformat} > Additionally, the equivalent Unix time (1520921903, which is also > "2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer: > {noformat} > df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" > ).as("dt")).show > +---+ > |
[jira] [Commented] (SPARK-21661) SparkSQL can't merge load table from Hadoop
[ https://issues.apache.org/jira/browse/SPARK-21661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454070#comment-16454070 ] Li Yuanjian commented on SPARK-21661: - Got it. > SparkSQL can't merge load table from Hadoop > --- > > Key: SPARK-21661 > URL: https://issues.apache.org/jira/browse/SPARK-21661 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Dapeng Sun >Assignee: Li Yuanjian >Priority: Major > Fix For: 2.3.0 > > > Here is the original text of external table on HDFS: > {noformat} > PermissionOwner Group SizeLast Modified Replication Block > Size Name > -rw-r--r--rootsupergroup 0 B 8/6/2017, 11:43:03 PM 3 > 256 MB income_band_001.dat > -rw-r--r--rootsupergroup 0 B 8/6/2017, 11:39:31 PM 3 > 256 MB income_band_002.dat > ... > -rw-r--r--rootsupergroup 327 B 8/6/2017, 11:44:47 PM 3 > 256 MB income_band_530.dat > {noformat} > After SparkSQL load, every files have a output file, even the files are 0B. > For the load on Hive, the data files would be merged according the data size > of original files. > Reproduce: > {noformat} > CREATE EXTERNAL TABLE t1 (a int,b string) STORED AS TEXTFILE LOCATION > "hdfs://xxx:9000/data/t1" > CREATE TABLE t2 STORED AS PARQUET AS SELECT * FROM t1; > {noformat} > The table t2 have many small files without data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23151) Provide a distribution of Spark with Hadoop 3.0
[ https://issues.apache.org/jira/browse/SPARK-23151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453991#comment-16453991 ] Steve Loughran commented on SPARK-23151: well, there's "have everything work on Hadoop 3" and "release it", so it could be kept open, or just make this an implict aspecvt of SPARK-23534 > Provide a distribution of Spark with Hadoop 3.0 > --- > > Key: SPARK-23151 > URL: https://issues.apache.org/jira/browse/SPARK-23151 > Project: Spark > Issue Type: Dependency upgrade > Components: Spark Core >Affects Versions: 2.2.0, 2.2.1 >Reporter: Louis Burke >Priority: Major > > Provide a Spark package that supports Hadoop 3.0.0. Currently the Spark > package > only supports Hadoop 2.7 i.e. spark-2.2.1-bin-hadoop2.7.tgz. The implication > is > that using up to date Kinesis libraries alongside s3 causes a clash w.r.t > aws-java-sdk. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24100) Add the CompressionCodec to the saveAsTextFiles interface.
[ https://issues.apache.org/jira/browse/SPARK-24100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453952#comment-16453952 ] Apache Spark commented on SPARK-24100: -- User 'WzRaCai' has created a pull request for this issue: https://github.com/apache/spark/pull/21167 > Add the CompressionCodec to the saveAsTextFiles interface. > -- > > Key: SPARK-24100 > URL: https://issues.apache.org/jira/browse/SPARK-24100 > Project: Spark > Issue Type: Improvement > Components: DStreams, PySpark >Affects Versions: 2.3.0 >Reporter: caijie >Priority: Minor > > Add the CompressionCodec to the saveAsTextFiles interface. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24100) Add the CompressionCodec to the saveAsTextFiles interface.
[ https://issues.apache.org/jira/browse/SPARK-24100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24100: Assignee: (was: Apache Spark) > Add the CompressionCodec to the saveAsTextFiles interface. > -- > > Key: SPARK-24100 > URL: https://issues.apache.org/jira/browse/SPARK-24100 > Project: Spark > Issue Type: Improvement > Components: DStreams, PySpark >Affects Versions: 2.3.0 >Reporter: caijie >Priority: Minor > > Add the CompressionCodec to the saveAsTextFiles interface. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24100) Add the CompressionCodec to the saveAsTextFiles interface.
[ https://issues.apache.org/jira/browse/SPARK-24100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24100: Assignee: Apache Spark > Add the CompressionCodec to the saveAsTextFiles interface. > -- > > Key: SPARK-24100 > URL: https://issues.apache.org/jira/browse/SPARK-24100 > Project: Spark > Issue Type: Improvement > Components: DStreams, PySpark >Affects Versions: 2.3.0 >Reporter: caijie >Assignee: Apache Spark >Priority: Minor > > Add the CompressionCodec to the saveAsTextFiles interface. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24100) Add the CompressionCodec to the saveAsTextFiles interface.
caijie created SPARK-24100: -- Summary: Add the CompressionCodec to the saveAsTextFiles interface. Key: SPARK-24100 URL: https://issues.apache.org/jira/browse/SPARK-24100 Project: Spark Issue Type: Improvement Components: DStreams, PySpark Affects Versions: 2.3.0 Reporter: caijie Add the CompressionCodec to the saveAsTextFiles interface. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23856) Spark jdbc setQueryTimeout option
[ https://issues.apache.org/jira/browse/SPARK-23856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453948#comment-16453948 ] Takeshi Yamamuro commented on SPARK-23856: -- [~dmitrymikhailov] You don't have time to make a pr? If no problem, I'll take it over. > Spark jdbc setQueryTimeout option > - > > Key: SPARK-23856 > URL: https://issues.apache.org/jira/browse/SPARK-23856 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dmitry Mikhailov >Priority: Minor > > It would be nice if a user could set the jdbc setQueryTimeout option when > running jdbc in Spark. I think it can be easily implemented by adding option > field to _JDBCOptions_ class and applying this option when initializing jdbc > statements in _JDBCRDD_ class. But only some DB vendors support this jdbc > feature. Is it worth starting a work on this option? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23929) pandas_udf schema mapped by position and not by name
[ https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453880#comment-16453880 ] Tr3wory commented on SPARK-23929: - Yes, the documentation is a must, even for 2.3 if possible (it's a new feature of 2.3, right?). > pandas_udf schema mapped by position and not by name > > > Key: SPARK-23929 > URL: https://issues.apache.org/jira/browse/SPARK-23929 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: PySpark > Spark 2.3.0 > >Reporter: Omri >Priority: Major > > The return struct of a pandas_udf should be mapped to the provided schema by > name. Currently it's not the case. > Consider these two examples, where the only change is the order of the fields > in the provided schema struct: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v")) > @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP) > def normalize(pdf): > v = pdf.v > return pdf.assign(v=(v - v.mean()) / v.std()) > df.groupby("id").apply(normalize).show() > {code} > and this one: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v")) > @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP) > def normalize(pdf): > v = pdf.v > return pdf.assign(v=(v - v.mean()) / v.std()) > df.groupby("id").apply(normalize).show() > {code} > The results should be the same but they are different: > For the first code: > {code:java} > +---+---+ > | v| id| > +---+---+ > |1.0| 0| > |1.0| 0| > |2.0| 0| > |2.0| 0| > |2.0| 1| > +---+---+ > {code} > For the second code: > {code:java} > +---+---+ > | id| v| > +---+---+ > | 1|-0.7071067811865475| > | 1| 0.7071067811865475| > | 2|-0.8320502943378437| > | 2|-0.2773500981126146| > | 2| 1.1094003924504583| > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4781) Column values become all NULL after doing ALTER TABLE CHANGE for renaming column names (Parquet external table in HiveContext)
[ https://issues.apache.org/jira/browse/SPARK-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453861#comment-16453861 ] Peter Simon commented on SPARK-4781: As commented under SPARK-11748, Possible workaround can be: {code:java} scala> spark.sql("set parquet.column.index.access=true").show scala> spark.sql("set spark.sql.hive.convertMetastoreParquet=false").show scala> spark.sql ("select * from test_parq").show ++--+ |a_ch| b| ++--+ | 1| a| | 2|test 2| ++--+{code} > Column values become all NULL after doing ALTER TABLE CHANGE for renaming > column names (Parquet external table in HiveContext) > -- > > Key: SPARK-4781 > URL: https://issues.apache.org/jira/browse/SPARK-4781 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.2.1, 1.3.0 >Reporter: Jianshi Huang >Priority: Major > > I have a table say created like follows: > {code} > CREATE EXTERNAL TABLE pmt ( > `sorted::cre_ts` string > ) > STORED AS PARQUET > LOCATION '...' > {code} > And I renamed the column from sorted::cre_ts to cre_ts by doing: > {code} > ALTER TABLE pmt CHANGE `sorted::cre_ts` cre_ts string > {code} > After renaming the column, the values in the column become all NULLs. > {noformat} > Before renaming: > scala> sql("select `sorted::cre_ts` from pmt limit 1").collect > res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54]) > Execute renaming: > scala> sql("alter table pmt change `sorted::cre_ts` cre_ts string") > res13: org.apache.spark.sql.SchemaRDD = > SchemaRDD[972] at RDD at SchemaRDD.scala:108 > == Query Plan == > > After renaming: > scala> sql("select cre_ts from pmt limit 1").collect > res16: Array[org.apache.spark.sql.Row] = Array([null]) > {noformat} > Jianshi -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11748) Result is null after alter column name of table stored as Parquet
[ https://issues.apache.org/jira/browse/SPARK-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453858#comment-16453858 ] Peter Simon commented on SPARK-11748: - Possible workaround can be: {code:java} scala> spark.sql("set parquet.column.index.access=true").show scala> spark.sql("set spark.sql.hive.convertMetastoreParquet=false").show scala> spark.sql ("select * from test_parq").show ++--+ |a_ch| b| ++--+ | 1| a| | 2|test 2| ++--+ {code} > Result is null after alter column name of table stored as Parquet > -- > > Key: SPARK-11748 > URL: https://issues.apache.org/jira/browse/SPARK-11748 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: pin_zhang >Priority: Major > > 1. Test with the following code > hctx.sql(" create table " + table + " (id int, str string) STORED AS > PARQUET ") > val df = hctx.jsonFile("g:/vip.json") > df.write.format("parquet").mode(SaveMode.Append).saveAsTable(table) > hctx.sql(" select * from " + table).show() > // alter table > val alter = "alter table " + table + " CHANGE id i_d int " > hctx.sql(alter) > > hctx.sql(" select * from " + table).show() > 2. Result > after change table column name, data in null for the changed column > Result before alter table > +---+---+ > | id|str| > +---+---+ > | 1| s1| > | 2| s2| > +---+---+ > Result after alter table > ++---+ > | i_d|str| > ++---+ > |null| s1| > |null| s2| > ++---+ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org