[jira] [Commented] (SPARK-28360) The serviceAccountName configuration item does not take effect in client mode.
[ https://issues.apache.org/jira/browse/SPARK-28360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376901#comment-17376901 ] Mathew Wicks commented on SPARK-28360: -- [~holden] this issue is still present. That is, the docs still imply that `spark.kubernetes.authenticate.serviceAccountName` will set the serviceAccountName in client mode. You can [see the faulty docs |https://spark.apache.org/docs/3.1.2/running-on-kubernetes.html] under the "Meaning" column of the `spark.kubernetes.authenticate.driver.serviceAccountName` config of the Spark 3.1.2 docs. {quote}Service account that is used when running the driver pod. The driver pod uses this service account when requesting executor pods from the API server. Note that this cannot be specified alongside a CA cert file, client key file, client cert file, and/or OAuth token. In client mode, use spark.kubernetes.authenticate.serviceAccountName instead. {quote} > The serviceAccountName configuration item does not take effect in client mode. > -- > > Key: SPARK-28360 > URL: https://issues.apache.org/jira/browse/SPARK-28360 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: zhixingheyi_tian >Priority: Major > > From the configuration item description from the spark document: > https://spark.apache.org/docs/latest/running-on-kubernetes.html > > “spark.kubernetes.authenticate.driver.serviceAccountName default Service > account that is used when running the driver pod. The driver pod uses this > service account when requesting executor pods from the API server. Note that > this cannot be specified alongside a CA cert file, client key file, client > cert file, and/or OAuth token. In client mode, use > spark.kubernetes.authenticate.serviceAccountName instead.” > But in client mode. “spark.kubernetes.authenticate.serviceAccountName” does > not take effect in fact. > From the analysis of source codes, spark does not get this configuration item > "spark.kubernetes.authenticate.serviceAccountName". > In Unit Tests, only cases for > "spark.kubernetes.authenticate.driver.serviceAccountName". > In kubernetes, a service account provides an identity for processes that run > in a Pod. When you create a pod, if you do not specify a service account, it > is automatically assigned the default service account in the same namespace. > Add a “spec.serviceAccountName” when creating a pod , can specify a custom > service account. > So in client mode, If you run your driver inside a Kubernetes pod, the > serviceaccount has already existed. If your application is not running inside > a pod, no serviceaccount is needed at all. > From my point of view, just modify the document and delete the > "spark.kubernetes.authenticate.serviceAccountName" configuration item > description. Because it doesn't work at the moment, it also doesn't need to > work. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32226) JDBC TimeStamp predicates always append `.0`
[ https://issues.apache.org/jira/browse/SPARK-32226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156413#comment-17156413 ] Mathew Wicks commented on SPARK-32226: -- [~Chen Zhang], thanks for the idea! However, while `DATETIME YEAR TO SECOND` dosen't support `.0` suffix, `DATETIME YEAR TO FRACTION` does, so we would need to put some logic to detect what the type of the source column is. BTW Is adding new dialects something which is accepted into core spark these days? > JDBC TimeStamp predicates always append `.0` > > > Key: SPARK-32226 > URL: https://issues.apache.org/jira/browse/SPARK-32226 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Mathew Wicks >Priority: Major > > If you have an Informix column with type `DATETIME YEAR TO SECOND`, Informix > will not let you pass a filter of the form `2020-01-01 00:00:00.0` (with the > `.0` at the end). > > In Spark 3.0.0, our predicate pushdown will alway append this `.0` to the end > of a TimeStamp column filter, even if you don't specify it: > {code:java} > df.where("col1 > '2020-01-01 00:00:00'") > {code} > > I think we should only pass the `.XXX` suffix if the user passes it in the > filter, for example: > {code:java} > df.where("col1 > '2020-01-01 00:00:00.123'") > {code} > > The relevant Spark class is: > {code:java} > org.apache.spark.sql.catalyst.util.DateTimeUtils.timestampToString > {code} > > To aid people searching for this error, here is the error emitted by spark: > {code:java} > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2023) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1972) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1971) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1971) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:950) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:950) > at scala.Option.foreach(Option.scala:407) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:950) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2203) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2152) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2141) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:752) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2093) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2114) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2133) > at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:467) > at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:420) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47) > at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3625) > at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2695) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3616) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3614) > at org.apache.spark.sql.Dataset.head(Dataset.scala:2695) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2902) > at org.apache.spark.sql.Dataset.getRows(Dataset.scala:300) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:337) > at org.apache.spark.sql.Dataset.show(Dataset.scala:824) > at org.apache.spark.sql.Dataset.show(Dataset.scala:783) > at org.apache.spark.sql.Dataset.show(Dataset.scala:792) > ... 47 elided > Caused by: java.sql.SQLException: Extra characters at the end of a datetime > or interval. > at com.i
[jira] [Created] (SPARK-32226) JDBC TimeStamp predicates always append `.0`
Mathew Wicks created SPARK-32226: Summary: JDBC TimeStamp predicates always append `.0` Key: SPARK-32226 URL: https://issues.apache.org/jira/browse/SPARK-32226 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Mathew Wicks If you have an Informix column with type `DATETIME YEAR TO SECOND`, Informix will not let you pass a filter of the form `2020-01-01 00:00:00.0` (with the `.0` at the end). In Spark 3.0.0, our predicate pushdown will alway append this `.0` to the end of a TimeStamp column filter, even if you don't specify it: {code:java} df.where("col1 > '2020-01-01 00:00:00'") {code} I think we should only pass the `.XXX` suffix if the user passes it in the filter, for example: {code:java} df.where("col1 > '2020-01-01 00:00:00.123'") {code} The relevant Spark class is: {code:java} org.apache.spark.sql.catalyst.util.DateTimeUtils.timestampToString {code} To aid people searching for this error, here is the error emitted by spark: {code:java} Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2023) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1972) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1971) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1971) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:950) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:950) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:950) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2203) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2152) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2141) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:752) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2093) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2114) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2133) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:467) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:420) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3625) at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2695) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3616) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3614) at org.apache.spark.sql.Dataset.head(Dataset.scala:2695) at org.apache.spark.sql.Dataset.take(Dataset.scala:2902) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:300) at org.apache.spark.sql.Dataset.showString(Dataset.scala:337) at org.apache.spark.sql.Dataset.show(Dataset.scala:824) at org.apache.spark.sql.Dataset.show(Dataset.scala:783) at org.apache.spark.sql.Dataset.show(Dataset.scala:792) ... 47 elided Caused by: java.sql.SQLException: Extra characters at the end of a datetime or interval. at com.informix.util.IfxErrMsg.buildExceptionWithMessage(IfxErrMsg.java:416) at com.informix.util.IfxErrMsg.buildIsamException(IfxErrMsg.java:401) at com.informix.jdbc.IfxSqli.addException(IfxSqli.java:3096) at com.informix.jdbc.IfxSqli.receiveError(IfxSqli.java:3368) at com.informix.jdbc.IfxSqli.dispatchMsg(IfxSqli.java:2292) at com.informix.jdbc.IfxSqli.receiveMessage(IfxSqli.java:2217) at com.informix.jdbc.IfxSqli.executePrepare(IfxSqli.java:1213) at com.informix.jdbc.IfxPreparedStatement.setupExecutePrepare(IfxPreparedStatement.java:245) at com.informix.jdbc.IfxPreparedStatement.processSQL(IfxPreparedStatement.java:229) at com.informix.jdbc.IfxPreparedStatement.(IfxPreparedStatement.java:119) at
[jira] [Commented] (SPARK-26295) [K8S] serviceAccountName is not set in client mode
[ https://issues.apache.org/jira/browse/SPARK-26295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17024838#comment-17024838 ] Mathew Wicks commented on SPARK-26295: -- I am still encountering this issue on 2.4.4, (and given SPARK-28360, this issue likely also occurs in Spark 3.0's current preview, but I haven't verified this). Can anyone take a look at this [~dongjoon]? The issue is effectively that `spark.kubernetes.authenticate.driver.serviceAccountName` and `spark.kubernetes.authenticate.serviceAccountName` are ignored in client mode with K8S master. No matter what you specify, the default service account for `spark.kubernetes.namespace` namespace is used > [K8S] serviceAccountName is not set in client mode > -- > > Key: SPARK-26295 > URL: https://issues.apache.org/jira/browse/SPARK-26295 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Adrian Tanase >Priority: Major > > When deploying spark apps in client mode (in my case from inside the driver > pod), one can't specify the service account in accordance to the docs > ([https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac).] > The property {{spark.kubernetes.authenticate.driver.serviceAccountName}} is > most likely added in cluster mode only, which would be consistent with > {{spark.kubernetes.authenticate.driver}} being the cluster mode prefix. > We should either inject the service account specified by this property in the > client mode pods, or specify an equivalent config: > {{spark.kubernetes.authenticate.serviceAccountName}} > This is the exception: > {noformat} > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. pods "..." is forbidden: User > "system:serviceaccount:mynamespace:default" cannot get pods in the namespace > "mynamespace"{noformat} > The expectation was to see the user {{mynamespace:spark}} based on my submit > command. > My current workaround is to create a clusterrolebinding with edit rights for > the mynamespace:default account. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28360) The serviceAccountName configuration item does not take effect in client mode.
[ https://issues.apache.org/jira/browse/SPARK-28360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17024835#comment-17024835 ] Mathew Wicks commented on SPARK-28360: -- This issue was also reported for 2.4, but has not been fixed. > The serviceAccountName configuration item does not take effect in client mode. > -- > > Key: SPARK-28360 > URL: https://issues.apache.org/jira/browse/SPARK-28360 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: zhixingheyi_tian >Priority: Major > > From the configuration item description from the spark document: > https://spark.apache.org/docs/latest/running-on-kubernetes.html > > “spark.kubernetes.authenticate.driver.serviceAccountName default Service > account that is used when running the driver pod. The driver pod uses this > service account when requesting executor pods from the API server. Note that > this cannot be specified alongside a CA cert file, client key file, client > cert file, and/or OAuth token. In client mode, use > spark.kubernetes.authenticate.serviceAccountName instead.” > But in client mode. “spark.kubernetes.authenticate.serviceAccountName” does > not take effect in fact. > From the analysis of source codes, spark does not get this configuration item > "spark.kubernetes.authenticate.serviceAccountName". > In Unit Tests, only cases for > "spark.kubernetes.authenticate.driver.serviceAccountName". > In kubernetes, a service account provides an identity for processes that run > in a Pod. When you create a pod, if you do not specify a service account, it > is automatically assigned the default service account in the same namespace. > Add a “spec.serviceAccountName” when creating a pod , can specify a custom > service account. > So in client mode, If you run your driver inside a Kubernetes pod, the > serviceaccount has already existed. If your application is not running inside > a pod, no serviceaccount is needed at all. > From my point of view, just modify the document and delete the > "spark.kubernetes.authenticate.serviceAccountName" configuration item > description. Because it doesn't work at the moment, it also doesn't need to > work. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.12.10, 1.11.10)
[ https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023439#comment-17023439 ] Mathew Wicks commented on SPARK-28921: -- [~dongjoon], it's just very bad practice to not update all jars which depend on each other, so I never tried to only do one. However, I also remember reading people who said they encountered errors while only updating one, on other threads about this issue. > Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, > 1.12.10, 1.11.10) > --- > > Key: SPARK-28921 > URL: https://issues.apache.org/jira/browse/SPARK-28921 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0, 2.3.1, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4 >Reporter: Paul Schweigert >Assignee: Andy Grove >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > Spark jobs are failing on latest versions of Kubernetes when jobs attempt to > provision executor pods (jobs like Spark-Pi that do not launch executors run > without a problem): > > Here's an example error message: > > {code:java} > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes. > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: > HTTP 403, Status: 403 - > java.net.ProtocolException: Expected HTTP 101 response but was '403 > Forbidden' > at > okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) > at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) > at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) > at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > at java.lang.Thread.run(Thread.java:748) > {code} > > Looks like the issue is caused by fixes for a recent CVE : > CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809] > Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669] > > Looks like upgrading kubernetes-client to 4.4.2 would solve this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.12.10, 1.11.10)
[ https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022625#comment-17022625 ] Mathew Wicks commented on SPARK-28921: -- It is not enough to replace the kuberntes-client.jar in your $SPARK_HOME/jars, you must also replace: * $SPARK_HOME/jars/kubernetes-client-*.jar * $SPARK_HOME/jars/kubernetes-model-common-*jar * $SPARK_HOME/jars/kubernetes-model-*.jar * $SPARK_HOME/jars/okhttp-*.jar * $SPARK_HOME/jars/okio-*.jar With the versions specified in this PR: https://github.com/apache/spark/commit/65c0a7812b472147c615fb4fe779da9d0a11ff18 > Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, > 1.12.10, 1.11.10) > --- > > Key: SPARK-28921 > URL: https://issues.apache.org/jira/browse/SPARK-28921 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0, 2.3.1, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4 >Reporter: Paul Schweigert >Assignee: Andy Grove >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > Spark jobs are failing on latest versions of Kubernetes when jobs attempt to > provision executor pods (jobs like Spark-Pi that do not launch executors run > without a problem): > > Here's an example error message: > > {code:java} > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes. > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: > HTTP 403, Status: 403 - > java.net.ProtocolException: Expected HTTP 101 response but was '403 > Forbidden' > at > okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) > at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) > at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) > at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > at java.lang.Thread.run(Thread.java:748) > {code} > > Looks like the issue is caused by fixes for a recent CVE : > CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809] > Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669] > > Looks like upgrading kubernetes-client to 4.4.2 would solve this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.12.10, 1.11.10)
[ https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022625#comment-17022625 ] Mathew Wicks edited comment on SPARK-28921 at 1/24/20 1:03 AM: --- It is not enough to replace the kuberntes-client.jar in your $SPARK_HOME/jars, you must also replace: * $SPARK_HOME/jars/kubernetes-client-*.jar * $SPARK_HOME/jars/kubernetes-model-common-*jar * $SPARK_HOME/jars/kubernetes-model-*.jar * $SPARK_HOME/jars/okhttp-*.jar * $SPARK_HOME/jars/okio-*.jar With the versions specified in this PR: [https://github.com/apache/spark/commit/65c0a7812b472147c615fb4fe779da9d0a11ff18] was (Author: thesuperzapper): It is not enough to replace the kuberntes-client.jar in your $SPARK_HOME/jars, you must also replace: * $SPARK_HOME/jars/kubernetes-client-*.jar * $SPARK_HOME/jars/kubernetes-model-common-*jar * $SPARK_HOME/jars/kubernetes-model-*.jar * $SPARK_HOME/jars/okhttp-*.jar * $SPARK_HOME/jars/okio-*.jar With the versions specified in this PR: https://github.com/apache/spark/commit/65c0a7812b472147c615fb4fe779da9d0a11ff18 > Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, > 1.12.10, 1.11.10) > --- > > Key: SPARK-28921 > URL: https://issues.apache.org/jira/browse/SPARK-28921 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0, 2.3.1, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4 >Reporter: Paul Schweigert >Assignee: Andy Grove >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > Spark jobs are failing on latest versions of Kubernetes when jobs attempt to > provision executor pods (jobs like Spark-Pi that do not launch executors run > without a problem): > > Here's an example error message: > > {code:java} > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes. > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: > HTTP 403, Status: 403 - > java.net.ProtocolException: Expected HTTP 101 response but was '403 > Forbidden' > at > okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) > at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) > at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) > at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > at java.lang.Thread.run(Thread.java:748) > {code} > > Looks like the issue is caused by fixes for a recent CVE : > CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809] > Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669] > > Looks like upgrading kubernetes-client to 4.4.2 would solve this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24632) Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers for persistence
[ https://issues.apache.org/jira/browse/SPARK-24632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16888495#comment-16888495 ] Mathew Wicks commented on SPARK-24632: -- I have an elegant solution for this: You can include a separate Python package which mirrors the class address for the java objects you wrap. For example, in the PySpark API for XGBoost I did created the following package for objects under *ml.dmlc.xgboost4j.scala.spark._* {code:java} ml/__init__.py ml/dmlc/__init__.py ml/dmlc/xgboost4j/__init__.py ml/dmlc/xgboost4j/scala/__init__.py ml/dmlc/xgboost4j/scala/spark/__init__.py {code} With all __init__.py empty except the final one, which contained: {code:java} import sys from sparkxgb import xgboost # Allows Pipeline()/PipelineModel() with XGBoost stages to be loaded from disk. # Needed because they try to import Python objects from their Java location. sys.modules['ml.dmlc.xgboost4j.scala.spark'] = xgboost {code} Where my actual Python wrapper classes are under *sparkxgb.xgboost*. This works because PySpark will try import from the Java address of the class, even though it's in Python. For more context: can find [the initial PR here|https://github.com/dmlc/xgboost/pull/4656]. > Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers > for persistence > -- > > Key: SPARK-24632 > URL: https://issues.apache.org/jira/browse/SPARK-24632 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: Joseph K. Bradley >Priority: Major > > This is a follow-up for [SPARK-17025], which allowed users to implement > Python PipelineStages in 3rd-party libraries, include them in Pipelines, and > use Pipeline persistence. This task is to make it easier for 3rd-party > libraries to have PipelineStages written in Java and then to use pyspark.ml > abstractions to create wrappers around those Java classes. This is currently > possible, except that users hit bugs around persistence. > I spent a bit thinking about this and wrote up thoughts and a proposal in the > doc linked below. Summary of proposal: > Require that 3rd-party libraries with Java classes with Python wrappers > implement a trait which provides the corresponding Python classpath in some > field: > {code} > trait PythonWrappable { > def pythonClassPath: String = … > } > MyJavaType extends PythonWrappable > {code} > This will not be required for MLlib wrappers, which we can handle specially. > One issue for this task will be that we may have trouble writing unit tests. > They would ideally test a Java class + Python wrapper class pair sitting > outside of pyspark. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28032) DataFrame.saveAsTable( in AVRO format with Timestamps create bad Hive tables
Mathew Wicks created SPARK-28032: Summary: DataFrame.saveAsTable( in AVRO format with Timestamps create bad Hive tables Key: SPARK-28032 URL: https://issues.apache.org/jira/browse/SPARK-28032 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.3 Environment: Spark 2.4.3 Hive 1.1.0 Reporter: Mathew Wicks I am not sure if it's my very old version of Hive (1.1.0), but when I use the following code, I end up with a table which Spark can read, but Hive cannot. That is to say, when writing AVRO format tables, they cannot be read in Hive if they contain timestamp types. *Hive error:* {code:java} Error while compiling statement: FAILED: UnsupportedOperationException timestamp is not supported. {code} *Spark Code:* {code:java} import java.sql.Timestamp import spark.implicits._ val currentTime = new Timestamp(System.currentTimeMillis()) val df = Seq( (currentTime) ).toDF() df.write.mode("overwrite").format("avro").saveAsTable("database.table_name") {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28008) Default values & column comments in AVRO schema converters
[ https://issues.apache.org/jira/browse/SPARK-28008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862699#comment-16862699 ] Mathew Wicks commented on SPARK-28008: -- The only issue I could think, would be that the column comments aren't saved. (Which some users might want) While I agree it doesn't seem like the api should be public, it is useful to know what schema a dataframe will be written with. (Some spark type have to be converted for avro). Also, the user might want to make changes and then use the "avroSchema" writer option, for example, writing timestamps in "timestamp-milis" type rather than "timestamp-micro". Beyond that, is there really harm in having a more correct conversion from the StructType into AVRO Schema? > Default values & column comments in AVRO schema converters > -- > > Key: SPARK-28008 > URL: https://issues.apache.org/jira/browse/SPARK-28008 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3 >Reporter: Mathew Wicks >Priority: Major > > Currently in both `toAvroType` and `toSqlType` > [SchemaConverters.scala#L134|https://github.com/apache/spark/blob/branch-2.4/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala#L134] > there are two behaviours which are unexpected. > h2. Nullable fields in spark are converted to UNION[TYPE, NULL] and no > default value is set: > *Current Behaviour:* > {code:java} > import org.apache.spark.sql.avro.SchemaConverters > import org.apache.spark.sql.types._ > val schema = new StructType().add("a", "string", nullable = true) > val avroSchema = SchemaConverters.toAvroType(schema) > println(avroSchema.toString(true)) > { > "type" : "record", > "name" : "topLevelRecord", > "fields" : [ { > "name" : "a", > "type" : [ "string", "null" ] > } ] > } > {code} > *Expected Behaviour:* > (NOTE: The reversal of "null" & "string" in the union, needed for a default > value of null) > {code:java} > import org.apache.spark.sql.avro.SchemaConverters > import org.apache.spark.sql.types._ > val schema = new StructType().add("a", "string", nullable = true) > val avroSchema = SchemaConverters.toAvroType(schema) > println(avroSchema.toString(true)) > { > "type" : "record", > "name" : "topLevelRecord", > "fields" : [ { > "name" : "a", > "type" : [ "null", "string" ], > "default" : null > } ] > }{code} > h2. Field comments/metadata is not propagated: > *Current Behaviour:* > {code:java} > import org.apache.spark.sql.avro.SchemaConverters > import org.apache.spark.sql.types._ > val schema = new StructType().add("a", "string", nullable=false, > comment="AAA") > val avroSchema = SchemaConverters.toAvroType(schema) > println(avroSchema.toString(true)) > { > "type" : "record", > "name" : "topLevelRecord", > "fields" : [ { > "name" : "a", > "type" : "string" > } ] > }{code} > *Expected Behaviour:* > {code:java} > import org.apache.spark.sql.avro.SchemaConverters > import org.apache.spark.sql.types._ > val schema = new StructType().add("a", "string", nullable=false, > comment="AAA") > val avroSchema = SchemaConverters.toAvroType(schema) > println(avroSchema.toString(true)) > { > "type" : "record", > "name" : "topLevelRecord", > "fields" : [ { > "name" : "a", > "type" : "string", > "doc" : "AAA" > } ] > }{code} > > The behaviour should be similar (but the reverse) for `toSqlType`. > I think we should aim to get this in before 3.0, as it will probably be a > breaking change for some usage of the AVRO API. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28008) Default values & column comments in AVRO schema converters
Mathew Wicks created SPARK-28008: Summary: Default values & column comments in AVRO schema converters Key: SPARK-28008 URL: https://issues.apache.org/jira/browse/SPARK-28008 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.3 Reporter: Mathew Wicks Currently in both `toAvroType` and `toSqlType` [SchemaConverters.scala#L134|https://github.com/apache/spark/blob/branch-2.4/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala#L134] there are two behaviours which are unexpected. h2. Nullable fields in spark are converted to UNION[TYPE, NULL] and no default value is set: *Current Behaviour:* {code:java} import org.apache.spark.sql.avro.SchemaConverters import org.apache.spark.sql.types._ val schema = new StructType().add("a", "string", nullable = true) val avroSchema = SchemaConverters.toAvroType(schema) println(avroSchema.toString(true)) { "type" : "record", "name" : "topLevelRecord", "fields" : [ { "name" : "a", "type" : [ "string", "null" ] } ] } {code} *Expected Behaviour:* (NOTE: The reversal of "null" & "string" in the union, needed for a default value of null) {code:java} import org.apache.spark.sql.avro.SchemaConverters import org.apache.spark.sql.types._ val schema = new StructType().add("a", "string", nullable = true) val avroSchema = SchemaConverters.toAvroType(schema) println(avroSchema.toString(true)) { "type" : "record", "name" : "topLevelRecord", "fields" : [ { "name" : "a", "type" : [ "null", "string" ], "default" : null } ] }{code} h2. Field comments/metadata is not propagated: *Current Behaviour:* {code:java} import org.apache.spark.sql.avro.SchemaConverters import org.apache.spark.sql.types._ val schema = new StructType().add("a", "string", nullable=false, comment="AAA") val avroSchema = SchemaConverters.toAvroType(schema) println(avroSchema.toString(true)) { "type" : "record", "name" : "topLevelRecord", "fields" : [ { "name" : "a", "type" : "string" } ] }{code} *Expected Behaviour:* {code:java} import org.apache.spark.sql.avro.SchemaConverters import org.apache.spark.sql.types._ val schema = new StructType().add("a", "string", nullable=false, comment="AAA") val avroSchema = SchemaConverters.toAvroType(schema) println(avroSchema.toString(true)) { "type" : "record", "name" : "topLevelRecord", "fields" : [ { "name" : "a", "type" : "string", "doc" : "AAA" } ] }{code} The behaviour should be similar (but the reverse) for `toSqlType`. I think we should aim to get this in before 3.0, as it will probably be a breaking change for some usage of the AVRO API. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17477) SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type
[ https://issues.apache.org/jira/browse/SPARK-17477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16847192#comment-16847192 ] Mathew Wicks edited comment on SPARK-17477 at 5/24/19 2:52 AM: --- *UPDATE:* Sorry, I was mistaken, this is still an issue in Spark 2.4 This only seems to be an issue if "spark.sql.parquet.writeLegacyFormat=false" when I set "spark.sql.parquet.writeLegacyFormat=true" this issue goes away. (For Hive 1.1.0 and Spark 2.4.3) was (Author: thesuperzapper): This only seems to be an issue if "spark.sql.parquet.writeLegacyFormat=false" when I set "spark.sql.parquet.writeLegacyFormat=true" this issue goes away. (For Hive 1.1.0 and Spark 2.4.3) > SparkSQL cannot handle schema evolution from Int -> Long when parquet files > have Int as its type while hive metastore has Long as its type > -- > > Key: SPARK-17477 > URL: https://issues.apache.org/jira/browse/SPARK-17477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Gang Wu >Priority: Major > > When using SparkSession to read a Hive table which is stored as parquet > files. If there has been a schema evolution from int to long of a column. > There are some old parquet files use int for the column while some new > parquet files use long. In Hive metastore, the type is long (bigint). > Therefore when I use the following: > {quote} > sparkSession.sql("select * from table").show() > {quote} > I got the following exception: > {quote} > 16/08/29 17:50:20 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 > (TID 91, XXX): org.apache.parquet.io.ParquetDecodingException: Can not read > value at 0 in block 0 in file > hdfs://path/to/parquet/1-part-r-0-d8e4f5aa-b6b9-4cad-8432-a7ae7a590a93.gz.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36) > at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:128) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to > org.apache.spark.sql.catalyst.expressions.MutableInt > at > org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:246) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setInt(ParquetRowConverter.scala:161) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addInt
[jira] [Issue Comment Deleted] (SPARK-16544) Support for conversion from compatible schema for Parquet data source when data types are not matched
[ https://issues.apache.org/jira/browse/SPARK-16544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mathew Wicks updated SPARK-16544: - Comment: was deleted (was: This only seems to be an issue if "spark.sql.parquet.writeLegacyFormat=false" when I set "spark.sql.parquet.writeLegacyFormat=true" this issue goes away. (For Hive 1.1.0 and Spark 2.4.3)) > Support for conversion from compatible schema for Parquet data source when > data types are not matched > - > > Key: SPARK-16544 > URL: https://issues.apache.org/jira/browse/SPARK-16544 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: Hyukjin Kwon >Priority: Major > > This deals with scenario 1 - case - 1 from the parent issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17477) SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type
[ https://issues.apache.org/jira/browse/SPARK-17477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16847192#comment-16847192 ] Mathew Wicks commented on SPARK-17477: -- This only seems to be an issue if "spark.sql.parquet.writeLegacyFormat=false" when I set "spark.sql.parquet.writeLegacyFormat=true" this issue goes away. (For Hive 1.1.0 and Spark 2.4.3) > SparkSQL cannot handle schema evolution from Int -> Long when parquet files > have Int as its type while hive metastore has Long as its type > -- > > Key: SPARK-17477 > URL: https://issues.apache.org/jira/browse/SPARK-17477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Gang Wu >Priority: Major > > When using SparkSession to read a Hive table which is stored as parquet > files. If there has been a schema evolution from int to long of a column. > There are some old parquet files use int for the column while some new > parquet files use long. In Hive metastore, the type is long (bigint). > Therefore when I use the following: > {quote} > sparkSession.sql("select * from table").show() > {quote} > I got the following exception: > {quote} > 16/08/29 17:50:20 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 > (TID 91, XXX): org.apache.parquet.io.ParquetDecodingException: Can not read > value at 0 in block 0 in file > hdfs://path/to/parquet/1-part-r-0-d8e4f5aa-b6b9-4cad-8432-a7ae7a590a93.gz.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36) > at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:128) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to > org.apache.spark.sql.catalyst.expressions.MutableInt > at > org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:246) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setInt(ParquetRowConverter.scala:161) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addInt(ParquetRowConverter.scala:85) > at > org.apache.parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:249) > at > org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:365) > at > org.apache.parquet.io.RecordReaderImplementation.rea
[jira] [Commented] (SPARK-16544) Support for conversion from compatible schema for Parquet data source when data types are not matched
[ https://issues.apache.org/jira/browse/SPARK-16544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16847193#comment-16847193 ] Mathew Wicks commented on SPARK-16544: -- This only seems to be an issue if "spark.sql.parquet.writeLegacyFormat=false" when I set "spark.sql.parquet.writeLegacyFormat=true" this issue goes away. (For Hive 1.1.0 and Spark 2.4.3) > Support for conversion from compatible schema for Parquet data source when > data types are not matched > - > > Key: SPARK-16544 > URL: https://issues.apache.org/jira/browse/SPARK-16544 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: Hyukjin Kwon >Priority: Major > > This deals with scenario 1 - case - 1 from the parent issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26388) No support for "alter table .. replace columns" to drop columns
[ https://issues.apache.org/jira/browse/SPARK-26388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846528#comment-16846528 ] Mathew Wicks commented on SPARK-26388: -- After a bit of investigating, it seems like the HiveExternalCatalog API has most of the needed functionality already, so just the SQL part need to be implemented. If there was a table named "database_name.table_name", you could overwrite its schema with this: {code:scala} import org.apache.spark.sql.types.StructType val schema = new StructType() .add("a", "string", nullable = true) .add("b", "string", nullable = true) .add("c", "string", nullable = true) spark.sharedState.externalCatalog.alterTableDataSchema("database_name", "table_name", schema) {code} Here is a link to the alterTableDataSchema() method: [https://github.com/apache/spark/blob/branch-2.4/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L647] > No support for "alter table .. replace columns" to drop columns > --- > > Key: SPARK-26388 > URL: https://issues.apache.org/jira/browse/SPARK-26388 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.1, 2.3.2 >Reporter: nirav patel >Priority: Major > > Looks like hive {{replace columns}} is not working with spark 2.2.1 and 2.3.1 > > create table myschema.mytable(a int, b int, c int) > alter table myschema.mytable replace columns (a int,b int,d int) > > *Expected Behavior* > it should drop column c and add column d. > alter table... replace columns.. should work just as it works in hive. > It replaces existing columns with new ones. It delete if column is not > mentioned. > > here's the snippet of hive cli: > hive> desc mytable; > OK > a int > b int > c int > Time taken: 0.05 seconds, Fetched: 3 row(s) > hive> alter table mytable replace columns(a int, b int, d int); > OK > Time taken: 0.078 seconds > hive> desc mytable; > OK > a int > b int > d int > Time taken: 0.03 seconds, Fetched: 3 row(s) > > *Actual Result* > Exception in thread "main" > org.apache.spark.sql.catalyst.parser.ParseException: Operation not allowed: > alter table replace columns > {{ADD COLUMNS}} works which seemed to previously reported and fixed as well: > https://issues.apache.org/jira/browse/SPARK-18893 > > Replace columns should be supported as well. afaik, that's the only way to > delete hive columns. > > > It supposed to work according to this docs: > > [https://docs.databricks.com/spark/latest/spark-sql/language-manual/alter-table-or-view.html#replace-columns] > > [https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#supported-hive-features] > > but it's throwing error for me on 2 different versions. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26388) No support for "alter table .. replace columns" to drop columns
[ https://issues.apache.org/jira/browse/SPARK-26388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846490#comment-16846490 ] Mathew Wicks commented on SPARK-26388: -- This test suite seems to imply this feature is not supported: [https://github.com/apache/spark/blob/branch-2.4/sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala#L460] > No support for "alter table .. replace columns" to drop columns > --- > > Key: SPARK-26388 > URL: https://issues.apache.org/jira/browse/SPARK-26388 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.1, 2.3.2 >Reporter: nirav patel >Priority: Major > > Looks like hive {{replace columns}} is not working with spark 2.2.1 and 2.3.1 > > create table myschema.mytable(a int, b int, c int) > alter table myschema.mytable replace columns (a int,b int,d int) > > *Expected Behavior* > it should drop column c and add column d. > alter table... replace columns.. should work just as it works in hive. > It replaces existing columns with new ones. It delete if column is not > mentioned. > > here's the snippet of hive cli: > hive> desc mytable; > OK > a int > b int > c int > Time taken: 0.05 seconds, Fetched: 3 row(s) > hive> alter table mytable replace columns(a int, b int, d int); > OK > Time taken: 0.078 seconds > hive> desc mytable; > OK > a int > b int > d int > Time taken: 0.03 seconds, Fetched: 3 row(s) > > *Actual Result* > Exception in thread "main" > org.apache.spark.sql.catalyst.parser.ParseException: Operation not allowed: > alter table replace columns > {{ADD COLUMNS}} works which seemed to previously reported and fixed as well: > https://issues.apache.org/jira/browse/SPARK-18893 > > Replace columns should be supported as well. afaik, that's the only way to > delete hive columns. > > > It supposed to work according to this docs: > > [https://docs.databricks.com/spark/latest/spark-sql/language-manual/alter-table-or-view.html#replace-columns] > > [https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#supported-hive-features] > > but it's throwing error for me on 2 different versions. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26388) No support for "alter table .. replace columns" to drop columns
[ https://issues.apache.org/jira/browse/SPARK-26388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846469#comment-16846469 ] Mathew Wicks commented on SPARK-26388: -- This appears to still be an issue in Spark 2.4.3. For all queries involving "ALTER TABLE table_name REPLACE COLUMNS (col_name STRING, ...)" you get: {code:java} Operation not allowed: ALTER TABLE REPLACE COLUMNS(line 1, pos 0){code} At very least we need to highlight this in the docs, as we currently say we support all Hive ALTER TABLE commands here: [https://spark.apache.org/docs/2.4.0/sql-migration-guide-hive-compatibility.html#supported-hive-features] > No support for "alter table .. replace columns" to drop columns > --- > > Key: SPARK-26388 > URL: https://issues.apache.org/jira/browse/SPARK-26388 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.1, 2.3.2 >Reporter: nirav patel >Priority: Major > > Looks like hive {{replace columns}} is not working with spark 2.2.1 and 2.3.1 > > create table myschema.mytable(a int, b int, c int) > alter table myschema.mytable replace columns (a int,b int,d int) > > *Expected Behavior* > it should drop column c and add column d. > alter table... replace columns.. should work just as it works in hive. > It replaces existing columns with new ones. It delete if column is not > mentioned. > > here's the snippet of hive cli: > hive> desc mytable; > OK > a int > b int > c int > Time taken: 0.05 seconds, Fetched: 3 row(s) > hive> alter table mytable replace columns(a int, b int, d int); > OK > Time taken: 0.078 seconds > hive> desc mytable; > OK > a int > b int > d int > Time taken: 0.03 seconds, Fetched: 3 row(s) > > *Actual Result* > Exception in thread "main" > org.apache.spark.sql.catalyst.parser.ParseException: Operation not allowed: > alter table replace columns > {{ADD COLUMNS}} works which seemed to previously reported and fixed as well: > https://issues.apache.org/jira/browse/SPARK-18893 > > Replace columns should be supported as well. afaik, that's the only way to > delete hive columns. > > > It supposed to work according to this docs: > > [https://docs.databricks.com/spark/latest/spark-sql/language-manual/alter-table-or-view.html#replace-columns] > > [https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#supported-hive-features] > > but it's throwing error for me on 2 different versions. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21423) MODE average aggregate function.
Mathew Wicks created SPARK-21423: Summary: MODE average aggregate function. Key: SPARK-21423 URL: https://issues.apache.org/jira/browse/SPARK-21423 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Mathew Wicks Priority: Minor Having a MODE() aggregate function which returns the mode average of a group/window would be very useful. For example, if the column type is a number, it finds the most common number, and if the column type is a string, it finds the most common string. I appreciate that doing this in a scalable way will require some thinking/discussion. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20353) Implement Tensorflow TFRecords file format
Mathew Wicks created SPARK-20353: Summary: Implement Tensorflow TFRecords file format Key: SPARK-20353 URL: https://issues.apache.org/jira/browse/SPARK-20353 Project: Spark Issue Type: Improvement Components: Input/Output, SQL Affects Versions: 2.1.0 Reporter: Mathew Wicks Spark is a very good prepossessing engine for tools like Tensorflow. However, we lack native support for Tensorflow's core file format, TFRecords. There is a project which implements this functionality as an external JAR. (But is not user friendly, or robust enough for production use.) https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-connector Here is some discussion around the above. https://github.com/tensorflow/ecosystem/issues/32 If we were to implement "tfrecords" as a data-frame writable/readable format, we would have to account for the various datatypes that can be present in spark columns, and which ones are actually useful in Tensorflow. Note: The `spark-tensorflow-connector` described above, does not properly support the vector data type. Further discussion of whether this is within the scope of Spark SQL is strongly welcomed. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20207) Add ablity to exclude current row in WindowSpec
[ https://issues.apache.org/jira/browse/SPARK-20207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15956403#comment-15956403 ] Mathew Wicks commented on SPARK-20207: -- A toy example is given in the StackOverflow. Although an alternative solution would be to implement array concatenation. Because, for most aggregations, you can split the calculation into the 'before current row' and 'after current row' partitions (think SUM()), but functions like COLLECT_LIST(), this is not possible. There is precedent for array concatenation in SQL, for example ARRAY_CONCAT() in BigQuery or ARRAY_CAT() in PostgresQL. http://www.w3resource.com/PostgreSQL/postgresql_array_cat-function.php > Add ablity to exclude current row in WindowSpec > --- > > Key: SPARK-20207 > URL: https://issues.apache.org/jira/browse/SPARK-20207 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Mathew Wicks >Priority: Minor > > It would be useful if we could implement a way to exclude the current row in > WindowSpec. (We can currently only select ranges of rows/time.) > Currently, users have to resort to ridiculous measures to exclude the current > row from windowing aggregations. > As seen here: > http://stackoverflow.com/questions/43180723/spark-sql-excluding-the-current-row-in-partition-by-windowing-functions/43198839#43198839 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20207) Add ablity to exclude current row in WindowSpec
[ https://issues.apache.org/jira/browse/SPARK-20207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15956372#comment-15956372 ] Mathew Wicks commented on SPARK-20207: -- Well, technically no, but it would be a good place to implement this functionality. Where would you suggest implementing it? > Add ablity to exclude current row in WindowSpec > --- > > Key: SPARK-20207 > URL: https://issues.apache.org/jira/browse/SPARK-20207 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Mathew Wicks >Priority: Minor > > It would be useful if we could implement a way to exclude the current row in > WindowSpec. (We can currently only select ranges of rows/time.) > Currently, users have to resort to ridiculous measures to exclude the current > row from windowing aggregations. > As seen here: > http://stackoverflow.com/questions/43180723/spark-sql-excluding-the-current-row-in-partition-by-windowing-functions/43198839#43198839 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20207) Add ablity to exclude current row in WindowSpec
[ https://issues.apache.org/jira/browse/SPARK-20207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15956363#comment-15956363 ] Mathew Wicks commented on SPARK-20207: -- Sorry I was a bit unclear, we would want to define a window over every row in the partition, except the current row. > Add ablity to exclude current row in WindowSpec > --- > > Key: SPARK-20207 > URL: https://issues.apache.org/jira/browse/SPARK-20207 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Mathew Wicks >Priority: Minor > > It would be useful if we could implement a way to exclude the current row in > WindowSpec. (We can currently only select ranges of rows/time.) > Currently, users have to resort to ridiculous measures to exclude the current > row from windowing aggregations. > As seen here: > http://stackoverflow.com/questions/43180723/spark-sql-excluding-the-current-row-in-partition-by-windowing-functions/43198839#43198839 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20207) Add ablity to exclude current row in WindowSpec
Mathew Wicks created SPARK-20207: Summary: Add ablity to exclude current row in WindowSpec Key: SPARK-20207 URL: https://issues.apache.org/jira/browse/SPARK-20207 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Mathew Wicks Priority: Minor It would be useful if we could implement a way to exclude the current row in WindowSpec. (We can currently only select ranges of rows/time.) Currently, users have to resort to ridiculous measures to exclude the current row from windowing aggregations. As seen here: http://stackoverflow.com/questions/43180723/spark-sql-excluding-the-current-row-in-partition-by-windowing-functions/43198839#43198839 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org