[jira] [Assigned] (SPARK-16240) model loading backward compatibility for ml.clustering.LDA
[ https://issues.apache.org/jira/browse/SPARK-16240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16240: Assignee: (was: Apache Spark) > model loading backward compatibility for ml.clustering.LDA > -- > > Key: SPARK-16240 > URL: https://issues.apache.org/jira/browse/SPARK-16240 > Project: Spark > Issue Type: Bug >Reporter: yuhao yang > > After resolving the matrix conversion issue, LDA model still cannot load 1.6 > models as one of the parameter name is changed. > https://github.com/apache/spark/pull/12065 > We can perhaps add some special logic in the loading code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16240) model loading backward compatibility for ml.clustering.LDA
[ https://issues.apache.org/jira/browse/SPARK-16240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368963#comment-15368963 ] Apache Spark commented on SPARK-16240: -- User 'GayathriMurali' has created a pull request for this issue: https://github.com/apache/spark/pull/14112 > model loading backward compatibility for ml.clustering.LDA > -- > > Key: SPARK-16240 > URL: https://issues.apache.org/jira/browse/SPARK-16240 > Project: Spark > Issue Type: Bug >Reporter: yuhao yang > > After resolving the matrix conversion issue, LDA model still cannot load 1.6 > models as one of the parameter name is changed. > https://github.com/apache/spark/pull/12065 > We can perhaps add some special logic in the loading code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16240) model loading backward compatibility for ml.clustering.LDA
[ https://issues.apache.org/jira/browse/SPARK-16240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16240: Assignee: Apache Spark > model loading backward compatibility for ml.clustering.LDA > -- > > Key: SPARK-16240 > URL: https://issues.apache.org/jira/browse/SPARK-16240 > Project: Spark > Issue Type: Bug >Reporter: yuhao yang >Assignee: Apache Spark > > After resolving the matrix conversion issue, LDA model still cannot load 1.6 > models as one of the parameter name is changed. > https://github.com/apache/spark/pull/12065 > We can perhaps add some special logic in the loading code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16456) Reuse the uncorrelated scalar subqueries with the same logical plan in a query
[ https://issues.apache.org/jira/browse/SPARK-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368926#comment-15368926 ] Apache Spark commented on SPARK-16456: -- User 'lianhuiwang' has created a pull request for this issue: https://github.com/apache/spark/pull/14111 > Reuse the uncorrelated scalar subqueries with the same logical plan in a query > -- > > Key: SPARK-16456 > URL: https://issues.apache.org/jira/browse/SPARK-16456 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > In TPCDS Q14, the same physical plan of uncorrelated scalar subqueries from a > CTE could be executed multiple times, we should re-use the same result to > avoid the duplicated computing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16456) Reuse the uncorrelated scalar subqueries with the same logical plan in a query
[ https://issues.apache.org/jira/browse/SPARK-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16456: Assignee: Apache Spark > Reuse the uncorrelated scalar subqueries with the same logical plan in a query > -- > > Key: SPARK-16456 > URL: https://issues.apache.org/jira/browse/SPARK-16456 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang >Assignee: Apache Spark > > In TPCDS Q14, the same physical plan of uncorrelated scalar subqueries from a > CTE could be executed multiple times, we should re-use the same result to > avoid the duplicated computing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16456) Reuse the uncorrelated scalar subqueries with the same logical plan in a query
[ https://issues.apache.org/jira/browse/SPARK-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16456: Assignee: (was: Apache Spark) > Reuse the uncorrelated scalar subqueries with the same logical plan in a query > -- > > Key: SPARK-16456 > URL: https://issues.apache.org/jira/browse/SPARK-16456 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > In TPCDS Q14, the same physical plan of uncorrelated scalar subqueries from a > CTE could be executed multiple times, we should re-use the same result to > avoid the duplicated computing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16446) Gaussian Mixture Model wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang reassigned SPARK-16446: --- Assignee: Yanbo Liang > Gaussian Mixture Model wrapper in SparkR > > > Key: SPARK-16446 > URL: https://issues.apache.org/jira/browse/SPARK-16446 > Project: Spark > Issue Type: Sub-task > Components: MLlib, SparkR >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > Follow instructions in SPARK-16442 and implement Gaussian Mixture Model > wrapper in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16446) Gaussian Mixture Model wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368916#comment-15368916 ] Yanbo Liang commented on SPARK-16446: - Sure. > Gaussian Mixture Model wrapper in SparkR > > > Key: SPARK-16446 > URL: https://issues.apache.org/jira/browse/SPARK-16446 > Project: Spark > Issue Type: Sub-task > Components: MLlib, SparkR >Reporter: Xiangrui Meng > > Follow instructions in SPARK-16442 and implement Gaussian Mixture Model > wrapper in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16456) Reuse the uncorrelated scalar subqueries with the same logical plan in a query
[ https://issues.apache.org/jira/browse/SPARK-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-16456: - Summary: Reuse the uncorrelated scalar subqueries with the same logical plan in a query (was: Reuse the uncorrelated scalar subqueries with with the same logical plan in a query) > Reuse the uncorrelated scalar subqueries with the same logical plan in a query > -- > > Key: SPARK-16456 > URL: https://issues.apache.org/jira/browse/SPARK-16456 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > In TPCDS Q14, the same physical plan of uncorrelated scalar subqueries from a > CTE could be executed multiple times, we should re-use the same result to > avoid the duplicated computing on the same uncorrelated scalar subquery. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16456) Reuse the uncorrelated scalar subqueries with the same logical plan in a query
[ https://issues.apache.org/jira/browse/SPARK-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-16456: - Description: In TPCDS Q14, the same physical plan of uncorrelated scalar subqueries from a CTE could be executed multiple times, we should re-use the same result to avoid the duplicated computing. (was: In TPCDS Q14, the same physical plan of uncorrelated scalar subqueries from a CTE could be executed multiple times, we should re-use the same result to avoid the duplicated computing on the same uncorrelated scalar subquery.) > Reuse the uncorrelated scalar subqueries with the same logical plan in a query > -- > > Key: SPARK-16456 > URL: https://issues.apache.org/jira/browse/SPARK-16456 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > In TPCDS Q14, the same physical plan of uncorrelated scalar subqueries from a > CTE could be executed multiple times, we should re-use the same result to > avoid the duplicated computing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16456) Reuse the uncorrelated scalar subqueries with with the same logical plan in a query
Lianhui Wang created SPARK-16456: Summary: Reuse the uncorrelated scalar subqueries with with the same logical plan in a query Key: SPARK-16456 URL: https://issues.apache.org/jira/browse/SPARK-16456 Project: Spark Issue Type: Improvement Components: SQL Reporter: Lianhui Wang In TPCDS Q14, the same physical plan of uncorrelated scalar subqueries from a CTE could be executed multiple times, we should re-use the same result to avoid the duplicated computing on the same uncorrelated scalar subquery. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11857) Remove Mesos fine-grained mode subject to discussions
[ https://issues.apache.org/jira/browse/SPARK-11857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-11857. - Resolution: Fixed Assignee: Michael Gummelt (was: Reynold Xin) Fix Version/s: 2.0.0 > Remove Mesos fine-grained mode subject to discussions > - > > Key: SPARK-11857 > URL: https://issues.apache.org/jira/browse/SPARK-11857 > Project: Spark > Issue Type: Sub-task > Components: Mesos >Reporter: Reynold Xin >Assignee: Michael Gummelt > Fix For: 2.0.0 > > > See discussions in > http://apache-spark-developers-list.1001551.n3.nabble.com/Removing-the-Mesos-fine-grained-mode-td15277.html > and > http://apache-spark-developers-list.1001551.n3.nabble.com/Please-reply-if-you-use-Mesos-fine-grained-mode-td14930.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16432) Empty blocks fail to serialize due to assert in ChunkedByteBuffer
[ https://issues.apache.org/jira/browse/SPARK-16432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16432. - Resolution: Fixed Assignee: Eric Liang Fix Version/s: 2.0.0 > Empty blocks fail to serialize due to assert in ChunkedByteBuffer > - > > Key: SPARK-16432 > URL: https://issues.apache.org/jira/browse/SPARK-16432 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Eric Liang >Assignee: Eric Liang > Fix For: 2.0.0 > > > See https://github.com/apache/spark/pull/11748#issuecomment-230760283 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16376) [Spark web UI]:HTTP ERROR 500 when using rest api "/applications/[app-id]/jobs" if array "stageIds" is empty
[ https://issues.apache.org/jira/browse/SPARK-16376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16376. - Resolution: Fixed Assignee: Sean Owen Fix Version/s: 2.0.0 > [Spark web UI]:HTTP ERROR 500 when using rest api > "/applications/[app-id]/jobs" if array "stageIds" is empty > > > Key: SPARK-16376 > URL: https://issues.apache.org/jira/browse/SPARK-16376 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 >Reporter: marymwu >Assignee: Sean Owen >Priority: Minor > Fix For: 2.0.0 > > > [Spark web UI]:HTTP ERROR 500 when using rest api > "/applications/[app-id]/jobs" if array "stageIds" is empty > See attachment for reference. > HTTP ERROR 500 > Problem accessing /api/v1/applications/application_1466239933301_175531/jobs. > Reason: > Server Error > Caused by: > java.lang.UnsupportedOperationException: empty.max > at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:216) > at scala.collection.AbstractTraversable.max(Traversable.scala:105) > at > org.apache.spark.status.api.v1.AllJobsResource$.convertJobData(AllJobsResource.scala:71) > at > org.apache.spark.status.api.v1.AllJobsResource$$anonfun$2$$anonfun$apply$2.apply(AllJobsResource.scala:46) > at > org.apache.spark.status.api.v1.AllJobsResource$$anonfun$2$$anonfun$apply$2.apply(AllJobsResource.scala:44) > at > scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721) > at > org.apache.spark.status.api.v1.AllJobsResource$$anonfun$2.apply(AllJobsResource.scala:44) > at > org.apache.spark.status.api.v1.AllJobsResource$$anonfun$2.apply(AllJobsResource.scala:43) > at > scala.collection.TraversableLike$WithFilter$$anonfun$flatMap$2.apply(TraversableLike.scala:753) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$WithFilter.flatMap(TraversableLike.scala:752) > at > org.apache.spark.status.api.v1.AllJobsResource.jobsList(AllJobsResource.scala:43) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) > at > com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) > at > com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) > at > com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288) > at > com.sun.jersey.server.impl.uri.rules.SubLocatorRule.accept(SubLocatorRule.java:134) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339) > at > com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416) > at > com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537) > at > com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:699) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:848) > at > org.spark-project.jetty.servlet.ServletHolder.handle(ServletHolder.java:684) > at > org.spark-project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1496) > at > org.apache.hadoop.yarn.server.webproxy.amfi
[jira] [Resolved] (SPARK-13569) Kafka DStreams from wildcard topic filters
[ https://issues.apache.org/jira/browse/SPARK-13569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-13569. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 14026 [https://github.com/apache/spark/pull/14026] > Kafka DStreams from wildcard topic filters > -- > > Key: SPARK-13569 > URL: https://issues.apache.org/jira/browse/SPARK-13569 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Miguel Miranda > Fix For: 2.0.0 > > > Expose Kafka's ConsumerConnector createMessageStreamsByFilter functionality > so that wildcards (whitelists and blacklists) can be used. > Impacts KafkaUtil's createStream interface so that something other than a > list of maps can passed as an argument. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16455) Add a new hook in CoarseGrainedSchedulerBackend in order to stop scheduling new tasks when cluster is restarting
[ https://issues.apache.org/jira/browse/SPARK-16455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YangyangLiu updated SPARK-16455: Description: In our case, we are implementing a new mechanism which will let driver survive when cluster is temporarily down and restarting. So when the service provided by cluster is not available, scheduler should stop scheduling new tasks. I added a hook inside CoarseGrainedSchedulerBackend class, in order to avoid new task scheduling when it's necessary. (was: In our case, we are implementing restartable resource manager mechanism which will let master executor survive when resource manager service is temporarily down. So when resource manager service is not available, scheduler should stop scheduling new tasks. I added a hook inside CoarseGrainedSchedulerBackend class, in order to avoid new task scheduling when it's necessary.) Summary: Add a new hook in CoarseGrainedSchedulerBackend in order to stop scheduling new tasks when cluster is restarting (was: Add a new hook in makeOffers in CoarseGrainedSchedulerBackend) > Add a new hook in CoarseGrainedSchedulerBackend in order to stop scheduling > new tasks when cluster is restarting > > > Key: SPARK-16455 > URL: https://issues.apache.org/jira/browse/SPARK-16455 > Project: Spark > Issue Type: New Feature > Components: Scheduler >Reporter: YangyangLiu >Priority: Minor > > In our case, we are implementing a new mechanism which will let driver > survive when cluster is temporarily down and restarting. So when the service > provided by cluster is not available, scheduler should stop scheduling new > tasks. I added a hook inside CoarseGrainedSchedulerBackend class, in order to > avoid new task scheduling when it's necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16455) Add a new hook in makeOffers in CoarseGrainedSchedulerBackend
[ https://issues.apache.org/jira/browse/SPARK-16455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16455: Assignee: Apache Spark > Add a new hook in makeOffers in CoarseGrainedSchedulerBackend > - > > Key: SPARK-16455 > URL: https://issues.apache.org/jira/browse/SPARK-16455 > Project: Spark > Issue Type: New Feature > Components: Scheduler >Reporter: YangyangLiu >Assignee: Apache Spark >Priority: Minor > > In our case, we are implementing restartable resource manager mechanism which > will let master executor survive when resource manager service is temporarily > down. So when resource manager service is not available, scheduler should > stop scheduling new tasks. I added a hook inside > CoarseGrainedSchedulerBackend class, in order to avoid new task scheduling > when it's necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16455) Add a new hook in makeOffers in CoarseGrainedSchedulerBackend
[ https://issues.apache.org/jira/browse/SPARK-16455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16455: Assignee: (was: Apache Spark) > Add a new hook in makeOffers in CoarseGrainedSchedulerBackend > - > > Key: SPARK-16455 > URL: https://issues.apache.org/jira/browse/SPARK-16455 > Project: Spark > Issue Type: New Feature > Components: Scheduler >Reporter: YangyangLiu >Priority: Minor > > In our case, we are implementing restartable resource manager mechanism which > will let master executor survive when resource manager service is temporarily > down. So when resource manager service is not available, scheduler should > stop scheduling new tasks. I added a hook inside > CoarseGrainedSchedulerBackend class, in order to avoid new task scheduling > when it's necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16455) Add a new hook in makeOffers in CoarseGrainedSchedulerBackend
[ https://issues.apache.org/jira/browse/SPARK-16455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368744#comment-15368744 ] Apache Spark commented on SPARK-16455: -- User 'lovexi' has created a pull request for this issue: https://github.com/apache/spark/pull/14110 > Add a new hook in makeOffers in CoarseGrainedSchedulerBackend > - > > Key: SPARK-16455 > URL: https://issues.apache.org/jira/browse/SPARK-16455 > Project: Spark > Issue Type: New Feature > Components: Scheduler >Reporter: YangyangLiu >Priority: Minor > > In our case, we are implementing restartable resource manager mechanism which > will let master executor survive when resource manager service is temporarily > down. So when resource manager service is not available, scheduler should > stop scheduling new tasks. I added a hook inside > CoarseGrainedSchedulerBackend class, in order to avoid new task scheduling > when it's necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16455) Add a new hook in makeOffers in CoarseGrainedSchedulerBackend
[ https://issues.apache.org/jira/browse/SPARK-16455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YangyangLiu updated SPARK-16455: Description: In our case, we are implementing restartable resource manager mechanism which will let master executor survive when resource manager service is temporarily down. So when resource manager service is not available, scheduler should stop scheduling new tasks. I added a hook inside CoarseGrainedSchedulerBackend class, in order to avoid new task scheduling when it's necessary. > Add a new hook in makeOffers in CoarseGrainedSchedulerBackend > - > > Key: SPARK-16455 > URL: https://issues.apache.org/jira/browse/SPARK-16455 > Project: Spark > Issue Type: New Feature > Components: Scheduler >Reporter: YangyangLiu >Priority: Minor > > In our case, we are implementing restartable resource manager mechanism which > will let master executor survive when resource manager service is temporarily > down. So when resource manager service is not available, scheduler should > stop scheduling new tasks. I added a hook inside > CoarseGrainedSchedulerBackend class, in order to avoid new task scheduling > when it's necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16455) Add a new hook in makeOffers in CoarseGrainedSchedulerBackend
[ https://issues.apache.org/jira/browse/SPARK-16455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YangyangLiu updated SPARK-16455: Labels: (was: newbie) > Add a new hook in makeOffers in CoarseGrainedSchedulerBackend > - > > Key: SPARK-16455 > URL: https://issues.apache.org/jira/browse/SPARK-16455 > Project: Spark > Issue Type: New Feature > Components: Scheduler >Reporter: YangyangLiu >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16424) Add support for Structured Streaming to the ML Pipeline API
[ https://issues.apache.org/jira/browse/SPARK-16424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk updated SPARK-16424: Description: For Spark 2.1 we should consider adding support for machine learning on top of the structured streaming API. Early work in progress design outline: https://docs.google.com/document/d/1snh7x7b0dQIlTsJNHLr-IxIFgP43RfRV271YK2qGiFQ/edit?usp=sharing was:For Spark 2.1 we should consider adding support for machine learning on top of the structured streaming API. > Add support for Structured Streaming to the ML Pipeline API > --- > > Key: SPARK-16424 > URL: https://issues.apache.org/jira/browse/SPARK-16424 > Project: Spark > Issue Type: Improvement > Components: ML, SQL, Streaming >Reporter: holdenk > > For Spark 2.1 we should consider adding support for machine learning on top > of the structured streaming API. > Early work in progress design outline: > https://docs.google.com/document/d/1snh7x7b0dQIlTsJNHLr-IxIFgP43RfRV271YK2qGiFQ/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16455) Add a new hook in makeOffers in CoarseGrainedSchedulerBackend
YangyangLiu created SPARK-16455: --- Summary: Add a new hook in makeOffers in CoarseGrainedSchedulerBackend Key: SPARK-16455 URL: https://issues.apache.org/jira/browse/SPARK-16455 Project: Spark Issue Type: New Feature Components: Scheduler Reporter: YangyangLiu Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16454) Consider adding a per-batch transform for structured streaming
holdenk created SPARK-16454: --- Summary: Consider adding a per-batch transform for structured streaming Key: SPARK-16454 URL: https://issues.apache.org/jira/browse/SPARK-16454 Project: Spark Issue Type: Improvement Components: SQL, Streaming Reporter: holdenk The new structured streaming API lacks the DStream functionality of transform (which allowed one to mix in existing RDD transformation logic). It would be useful to be able to do per-batch (even without any specific gaurantees about the batch being complete provided you eventually get called with the "catch up" records) processing as was done in the DStream API. This might be useful for implementing Streaming Machine Learning on Structured Streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16387) Reserved SQL words are not escaped by JDBC writer
[ https://issues.apache.org/jira/browse/SPARK-16387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16387. - Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 2.0.0 > Reserved SQL words are not escaped by JDBC writer > - > > Key: SPARK-16387 > URL: https://issues.apache.org/jira/browse/SPARK-16387 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Lev >Assignee: Dongjoon Hyun > Fix For: 2.0.0 > > > Here is a code (imports are omitted) > object Main extends App { > val sqlSession = SparkSession.builder().config(new SparkConf(). > setAppName("Sql Test").set("spark.app.id", "SQLTest"). > set("spark.master", "local[2]"). > set("spark.ui.enabled", "false") > .setJars(Seq("/mysql/mysql-connector-java-5.1.38.jar" )) > ).getOrCreate() > import sqlSession.implicits._ > val localprops = new Properties > localprops.put("user", "") > localprops.put("password", "") > val df = sqlSession.createDataset(Seq("a","b","c")).toDF("order") > val writer = df.write > .mode(SaveMode.Append) > writer > .jdbc("jdbc:mysql://localhost:3306/test3", s"jira_test", localprops) > } > End error is : > com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error > in your SQL syntax; check the manual that corresponds to your MySQL server > version for the right syntax to use near 'order TEXT )' at line 1 > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > Clearly the reserved word has to be quoted -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16404) LeastSquaresAggregator in Linear Regression serializes unnecessary data
[ https://issues.apache.org/jira/browse/SPARK-16404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16404: Assignee: Apache Spark > LeastSquaresAggregator in Linear Regression serializes unnecessary data > --- > > Key: SPARK-16404 > URL: https://issues.apache.org/jira/browse/SPARK-16404 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Seth Hendrickson >Assignee: Apache Spark > > This is basically the same issue as > [SPARK-16008|https://issues.apache.org/jira/browse/SPARK-16008], but for > linear regression, where {{coefficients}} and {{featuresStd}} are > unnecessarily serialized between stages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16404) LeastSquaresAggregator in Linear Regression serializes unnecessary data
[ https://issues.apache.org/jira/browse/SPARK-16404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16404: Assignee: (was: Apache Spark) > LeastSquaresAggregator in Linear Regression serializes unnecessary data > --- > > Key: SPARK-16404 > URL: https://issues.apache.org/jira/browse/SPARK-16404 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Seth Hendrickson > > This is basically the same issue as > [SPARK-16008|https://issues.apache.org/jira/browse/SPARK-16008], but for > linear regression, where {{coefficients}} and {{featuresStd}} are > unnecessarily serialized between stages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16404) LeastSquaresAggregator in Linear Regression serializes unnecessary data
[ https://issues.apache.org/jira/browse/SPARK-16404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368652#comment-15368652 ] Apache Spark commented on SPARK-16404: -- User 'sethah' has created a pull request for this issue: https://github.com/apache/spark/pull/14109 > LeastSquaresAggregator in Linear Regression serializes unnecessary data > --- > > Key: SPARK-16404 > URL: https://issues.apache.org/jira/browse/SPARK-16404 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Seth Hendrickson > > This is basically the same issue as > [SPARK-16008|https://issues.apache.org/jira/browse/SPARK-16008], but for > linear regression, where {{coefficients}} and {{featuresStd}} are > unnecessarily serialized between stages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16453) release script does not publish spark-hive-thriftserver_2.10
[ https://issues.apache.org/jira/browse/SPARK-16453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-16453. -- Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 14108 [https://github.com/apache/spark/pull/14108] > release script does not publish spark-hive-thriftserver_2.10 > > > Key: SPARK-16453 > URL: https://issues.apache.org/jira/browse/SPARK-16453 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Critical > Fix For: 2.0.1, 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16365) Ideas for moving "mllib-local" forward
[ https://issues.apache.org/jira/browse/SPARK-16365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368602#comment-15368602 ] Manoj Kumar commented on SPARK-16365: - Could you be a bit more clearer about the first point? Is it so that people can quickly prototype locally with a small subsample of the data before doing the dataframe | RDD conversion to handle huge amounts of data? > Ideas for moving "mllib-local" forward > -- > > Key: SPARK-16365 > URL: https://issues.apache.org/jira/browse/SPARK-16365 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Nick Pentreath > > Since SPARK-13944 is all done, we should all think about what the "next > steps" might be for {{mllib-local}}. E.g., it could be "improve Spark's > linear algebra", or "investigate how we will implement local models/pipelines > in Spark", etc. > This ticket is for comments, ideas, brainstormings and PoCs. The separation > of linalg into a standalone project turned out to be significantly more > complex than originally expected. So I vote we devote sufficient discussion > and time to planning out the next move :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15767) Decision Tree Regression wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-15767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368594#comment-15368594 ] Shivaram Venkataraman commented on SPARK-15767: --- Sorry I missed this thread. I agree with [~mengxr] that we should go with the `spark.algo` scheme and use the MLlib param names. In the future if we feel like we have significant overlap we can add a `rpart` wrapper that can mimic the existing package. In terms of naming my vote would be to go with something like `spark.decisiontree` or `spark.randomforest` -- its slightly better to be explicit is my take. > Decision Tree Regression wrapper in SparkR > -- > > Key: SPARK-15767 > URL: https://issues.apache.org/jira/browse/SPARK-15767 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Kai Jiang >Assignee: Kai Jiang > > Implement a wrapper in SparkR to support decision tree regression. R's naive > Decision Tree Regression implementation is from package rpart with signature > rpart(formula, dataframe, method="anova"). I propose we could implement API > like spark.rpart(dataframe, formula, ...) . After having implemented > decision tree classification, we could refactor this two into an API more > like rpart() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16452) basic INFORMATION_SCHEMA support
[ https://issues.apache.org/jira/browse/SPARK-16452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368552#comment-15368552 ] Dongjoon Hyun commented on SPARK-16452: --- I see. Thanks. > basic INFORMATION_SCHEMA support > > > Key: SPARK-16452 > URL: https://issues.apache.org/jira/browse/SPARK-16452 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Attachments: INFORMATION_SCHEMAsupport.pdf > > > INFORMATION_SCHEMA is part of SQL92 support. This ticket proposes adding a > few tables as defined in SQL92 standard to Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16452) basic INFORMATION_SCHEMA support
[ https://issues.apache.org/jira/browse/SPARK-16452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368550#comment-15368550 ] Reynold Xin commented on SPARK-16452: - org.apache.spark.sql.execution.systemcatalog ? Something like that. > basic INFORMATION_SCHEMA support > > > Key: SPARK-16452 > URL: https://issues.apache.org/jira/browse/SPARK-16452 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Attachments: INFORMATION_SCHEMAsupport.pdf > > > INFORMATION_SCHEMA is part of SQL92 support. This ticket proposes adding a > few tables as defined in SQL92 standard to Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16437) SparkR read.df() from parquet got error: SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder"
[ https://issues.apache.org/jira/browse/SPARK-16437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368548#comment-15368548 ] Xin Ren commented on SPARK-16437: - It's SQL's problem I think, I'll remove the SparkR tag > SparkR read.df() from parquet got error: SLF4J: Failed to load class > "org.slf4j.impl.StaticLoggerBinder" > > > Key: SPARK-16437 > URL: https://issues.apache.org/jira/browse/SPARK-16437 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Xin Ren >Priority: Minor > > build SparkR with command > {code} > build/mvn -DskipTests -Psparkr package > {code} > start SparkR console > {code} > ./bin/sparkR > {code} > then get error > {code} > Welcome to > __ >/ __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT > /_/ > SparkSession available as 'spark'. > > > > > > library(SparkR) > > > > df <- read.df("examples/src/main/resources/users.parquet") > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > > > > > > head(df) > 16/07/07 23:20:54 WARN ParquetRecordReader: Can not initialize counter due to > context is not a instance of TaskInputOutputContext, but is > org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl > name favorite_color favorite_numbers > 1 Alyssa3, 9, 15, 20 > 2Benred NULL > {code} > Reference > * seems need to add a lib from slf4j to point to older version > http://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder > * on slf4j official site: http://www.slf4j.org/codes.html#StaticLoggerBinder -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16437) SparkR read.df() from parquet got error: SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder"
[ https://issues.apache.org/jira/browse/SPARK-16437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Ren updated SPARK-16437: Component/s: (was: SparkR) > SparkR read.df() from parquet got error: SLF4J: Failed to load class > "org.slf4j.impl.StaticLoggerBinder" > > > Key: SPARK-16437 > URL: https://issues.apache.org/jira/browse/SPARK-16437 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Xin Ren >Priority: Minor > > build SparkR with command > {code} > build/mvn -DskipTests -Psparkr package > {code} > start SparkR console > {code} > ./bin/sparkR > {code} > then get error > {code} > Welcome to > __ >/ __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT > /_/ > SparkSession available as 'spark'. > > > > > > library(SparkR) > > > > df <- read.df("examples/src/main/resources/users.parquet") > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > > > > > > head(df) > 16/07/07 23:20:54 WARN ParquetRecordReader: Can not initialize counter due to > context is not a instance of TaskInputOutputContext, but is > org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl > name favorite_color favorite_numbers > 1 Alyssa3, 9, 15, 20 > 2Benred NULL > {code} > Reference > * seems need to add a lib from slf4j to point to older version > http://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder > * on slf4j official site: http://www.slf4j.org/codes.html#StaticLoggerBinder -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16437) SparkR read.df() from parquet got error: SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder"
[ https://issues.apache.org/jira/browse/SPARK-16437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368546#comment-15368546 ] Shivaram Venkataraman commented on SPARK-16437: --- FWIW this also happens with Scala (i.e. using bin/spark-shell). I dont think there is anything SparkR specific about this issue > SparkR read.df() from parquet got error: SLF4J: Failed to load class > "org.slf4j.impl.StaticLoggerBinder" > > > Key: SPARK-16437 > URL: https://issues.apache.org/jira/browse/SPARK-16437 > Project: Spark > Issue Type: Bug > Components: SparkR, SQL >Reporter: Xin Ren >Priority: Minor > > build SparkR with command > {code} > build/mvn -DskipTests -Psparkr package > {code} > start SparkR console > {code} > ./bin/sparkR > {code} > then get error > {code} > Welcome to > __ >/ __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT > /_/ > SparkSession available as 'spark'. > > > > > > library(SparkR) > > > > df <- read.df("examples/src/main/resources/users.parquet") > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > > > > > > head(df) > 16/07/07 23:20:54 WARN ParquetRecordReader: Can not initialize counter due to > context is not a instance of TaskInputOutputContext, but is > org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl > name favorite_color favorite_numbers > 1 Alyssa3, 9, 15, 20 > 2Benred NULL > {code} > Reference > * seems need to add a lib from slf4j to point to older version > http://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder > * on slf4j official site: http://www.slf4j.org/codes.html#StaticLoggerBinder -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15767) Decision Tree Regression wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-15767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368547#comment-15368547 ] Xiangrui Meng commented on SPARK-15767: --- This was discussed in SPARK-14831. We should call it `spark.algo(data, formula, method, required params, [optional params])` and use the same param names as in MLlib. But I'm not sure what method name to use here. We should think about method names for all tree methods together. cc [~josephkb] > Decision Tree Regression wrapper in SparkR > -- > > Key: SPARK-15767 > URL: https://issues.apache.org/jira/browse/SPARK-15767 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Kai Jiang >Assignee: Kai Jiang > > Implement a wrapper in SparkR to support decision tree regression. R's naive > Decision Tree Regression implementation is from package rpart with signature > rpart(formula, dataframe, method="anova"). I propose we could implement API > like spark.rpart(dataframe, formula, ...) . After having implemented > decision tree classification, we could refactor this two into an API more > like rpart() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16452) basic INFORMATION_SCHEMA support
[ https://issues.apache.org/jira/browse/SPARK-16452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368544#comment-15368544 ] Dongjoon Hyun commented on SPARK-16452: --- What package name do you prefer? > basic INFORMATION_SCHEMA support > > > Key: SPARK-16452 > URL: https://issues.apache.org/jira/browse/SPARK-16452 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Attachments: INFORMATION_SCHEMAsupport.pdf > > > INFORMATION_SCHEMA is part of SQL92 support. This ticket proposes adding a > few tables as defined in SQL92 standard to Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-16452) basic INFORMATION_SCHEMA support
[ https://issues.apache.org/jira/browse/SPARK-16452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-16452: -- Comment: was deleted (was: Is this the same or overlapped issue with https://issues.apache.org/jira/browse/SPARK-16201 ?) > basic INFORMATION_SCHEMA support > > > Key: SPARK-16452 > URL: https://issues.apache.org/jira/browse/SPARK-16452 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Attachments: INFORMATION_SCHEMAsupport.pdf > > > INFORMATION_SCHEMA is part of SQL92 support. This ticket proposes adding a > few tables as defined in SQL92 standard to Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16281) Implement parse_url SQL function
[ https://issues.apache.org/jira/browse/SPARK-16281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16281. - Resolution: Fixed Assignee: Jian Wu Fix Version/s: 2.0.1 > Implement parse_url SQL function > > > Key: SPARK-16281 > URL: https://issues.apache.org/jira/browse/SPARK-16281 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Jian Wu > Fix For: 2.0.1 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16429) Include `StringType` columns in `describe()`
[ https://issues.apache.org/jira/browse/SPARK-16429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16429. - Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 2.1.0 > Include `StringType` columns in `describe()` > > > Key: SPARK-16429 > URL: https://issues.apache.org/jira/browse/SPARK-16429 > Project: Spark > Issue Type: Improvement >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.1.0 > > > Currently, Spark `describe` supports `StringType`. However, `describe()` > returns a dataset for only all numeric columns. > This issue aims to include `StringType` columns in `describe()`, `describe` > without argument. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16453) release script does not publish spark-hive-thriftserver_2.10
[ https://issues.apache.org/jira/browse/SPARK-16453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368523#comment-15368523 ] Apache Spark commented on SPARK-16453: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/14108 > release script does not publish spark-hive-thriftserver_2.10 > > > Key: SPARK-16453 > URL: https://issues.apache.org/jira/browse/SPARK-16453 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16453) release script does not publish spark-hive-thriftserver_2.10
Yin Huai created SPARK-16453: Summary: release script does not publish spark-hive-thriftserver_2.10 Key: SPARK-16453 URL: https://issues.apache.org/jira/browse/SPARK-16453 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.0.0 Reporter: Yin Huai Assignee: Yin Huai Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16437) SparkR read.df() from parquet got error: SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder"
[ https://issues.apache.org/jira/browse/SPARK-16437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Ren updated SPARK-16437: Description: build SparkR with command {code} build/mvn -DskipTests -Psparkr package {code} start SparkR console {code} ./bin/sparkR {code} then get error {code} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT /_/ SparkSession available as 'spark'. > > > library(SparkR) > > df <- read.df("examples/src/main/resources/users.parquet") SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. > > > head(df) 16/07/07 23:20:54 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl name favorite_color favorite_numbers 1 Alyssa3, 9, 15, 20 2Benred NULL {code} Reference * seems need to add a lib from slf4j to point to older version http://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder * on slf4j official site: http://www.slf4j.org/codes.html#StaticLoggerBinder was: build SparkR with command {code} build/mvn -DskipTests -Psparkr package {code} start SparkR console {code} ./bin/sparkR {code} then get error {code} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT /_/ SparkSession available as 'spark'. > > > library(SparkR) > > df <- read.df("examples/src/main/resources/users.parquet") SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. > > > head(df) 16/07/07 23:20:54 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl name favorite_color favorite_numbers 1 Alyssa3, 9, 15, 20 2Benred NULL {code} seems need to add a lib from slf4j http://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder > SparkR read.df() from parquet got error: SLF4J: Failed to load class > "org.slf4j.impl.StaticLoggerBinder" > > > Key: SPARK-16437 > URL: https://issues.apache.org/jira/browse/SPARK-16437 > Project: Spark > Issue Type: Bug > Components: SparkR, SQL >Reporter: Xin Ren >Priority: Minor > > build SparkR with command > {code} > build/mvn -DskipTests -Psparkr package > {code} > start SparkR console > {code} > ./bin/sparkR > {code} > then get error > {code} > Welcome to > __ >/ __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT > /_/ > SparkSession available as 'spark'. > > > > > > library(SparkR) > > > > df <- read.df("examples/src/main/resources/users.parquet") > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > > > > > > head(df) > 16/07/07 23:20:54 WARN ParquetRecordReader: Can not initialize counter due to > context is not a instance of TaskInputOutputContext, but is > org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl > name favorite_color favorite_numbers > 1 Alyssa3, 9, 15, 20 > 2Benred NULL > {code} > Reference > * seems need to add a lib from slf4j to point to older version > http://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder > * on slf4j official site: http://www.slf4j.org/codes.html#StaticLoggerBinder -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-16201) Expose information schema
[ https://issues.apache.org/jira/browse/SPARK-16201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-16201. --- Resolution: Duplicate > Expose information schema > - > > Key: SPARK-16201 > URL: https://issues.apache.org/jira/browse/SPARK-16201 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Herman van Hovell > > As a Spark user I want to be able to query metadata, using information_schema > views. This is an umbrella ticket for adding this to Spark 2.1. > This should support: > - Databases > - Tables > - Views > - Columns > - Partitions > - Functions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15630) 2.0 python coverage ml root module
[ https://issues.apache.org/jira/browse/SPARK-15630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-15630: -- Target Version/s: 2.0.0 Priority: Blocker (was: Major) > 2.0 python coverage ml root module > -- > > Key: SPARK-15630 > URL: https://issues.apache.org/jira/browse/SPARK-15630 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Priority: Blocker > > Audit the root pipeline components in PySpark ML for API compatibility. See > parent SPARK-14813 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15623) 2.0 python coverage ml.feature
[ https://issues.apache.org/jira/browse/SPARK-15623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-15623: -- Target Version/s: 2.0.0 Priority: Blocker (was: Major) > 2.0 python coverage ml.feature > -- > > Key: SPARK-15623 > URL: https://issues.apache.org/jira/browse/SPARK-15623 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Priority: Blocker > > See parent task SPARK-14813. > [~bryanc] did this component. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14813) ML 2.0 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-14813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-14813: -- Target Version/s: 2.0.0 Priority: Blocker (was: Major) > ML 2.0 QA: API: Python API coverage > --- > > Key: SPARK-14813 > URL: https://issues.apache.org/jira/browse/SPARK-14813 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Reporter: Joseph K. Bradley >Assignee: holdenk >Priority: Blocker > > For new public APIs added to MLlib, we need to check the generated HTML doc > and compare the Scala & Python versions. We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > Please use a *separate* JIRA (linked below as "requires") for this list of > to-do items. > ** *NOTE: These missing features should be added in the next release. This > work is just to generate a list of to-do items for the future.* > UPDATE: This only needs to cover spark.ml since spark.mllib is going into > maintenance mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16437) SparkR read.df() from parquet got error: SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder"
[ https://issues.apache.org/jira/browse/SPARK-16437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16437: -- Fix Version/s: (was: 2.0.0) > SparkR read.df() from parquet got error: SLF4J: Failed to load class > "org.slf4j.impl.StaticLoggerBinder" > > > Key: SPARK-16437 > URL: https://issues.apache.org/jira/browse/SPARK-16437 > Project: Spark > Issue Type: Bug > Components: SparkR, SQL >Reporter: Xin Ren >Priority: Minor > > build SparkR with command > {code} > build/mvn -DskipTests -Psparkr package > {code} > start SparkR console > {code} > ./bin/sparkR > {code} > then get error > {code} > Welcome to > __ >/ __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT > /_/ > SparkSession available as 'spark'. > > > > > > library(SparkR) > > > > df <- read.df("examples/src/main/resources/users.parquet") > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > > > > > > head(df) > 16/07/07 23:20:54 WARN ParquetRecordReader: Can not initialize counter due to > context is not a instance of TaskInputOutputContext, but is > org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl > name favorite_color favorite_numbers > 1 Alyssa3, 9, 15, 20 > 2Benred NULL > {code} > seems need to add a lib from slf4j > http://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16452) basic INFORMATION_SCHEMA support
[ https://issues.apache.org/jira/browse/SPARK-16452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368504#comment-15368504 ] Dongjoon Hyun commented on SPARK-16452: --- Is this the same or overlapped issue with https://issues.apache.org/jira/browse/SPARK-16201 ? > basic INFORMATION_SCHEMA support > > > Key: SPARK-16452 > URL: https://issues.apache.org/jira/browse/SPARK-16452 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Attachments: INFORMATION_SCHEMAsupport.pdf > > > INFORMATION_SCHEMA is part of SQL92 support. This ticket proposes adding a > few tables as defined in SQL92 standard to Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16437) SparkR read.df() from parquet got error: SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder"
[ https://issues.apache.org/jira/browse/SPARK-16437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Ren updated SPARK-16437: Description: build SparkR with command {code} build/mvn -DskipTests -Psparkr package {code} start SparkR console {code} ./bin/sparkR {code} then get error {code} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT /_/ SparkSession available as 'spark'. > > > library(SparkR) > > df <- read.df("examples/src/main/resources/users.parquet") SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. > > > head(df) 16/07/07 23:20:54 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl name favorite_color favorite_numbers 1 Alyssa3, 9, 15, 20 2Benred NULL {code} seems need to add a lib from slf4j http://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder was: start SparkR console {code} ./bin/sparkR {code} then get error {code} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT /_/ SparkSession available as 'spark'. > > > library(SparkR) > > df <- read.df("examples/src/main/resources/users.parquet") SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. > > > head(df) 16/07/07 23:20:54 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl name favorite_color favorite_numbers 1 Alyssa3, 9, 15, 20 2Benred NULL {code} seems need to add a lib from slf4j http://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder > SparkR read.df() from parquet got error: SLF4J: Failed to load class > "org.slf4j.impl.StaticLoggerBinder" > > > Key: SPARK-16437 > URL: https://issues.apache.org/jira/browse/SPARK-16437 > Project: Spark > Issue Type: Bug > Components: SparkR, SQL >Reporter: Xin Ren >Priority: Minor > Fix For: 2.0.0 > > > build SparkR with command > {code} > build/mvn -DskipTests -Psparkr package > {code} > start SparkR console > {code} > ./bin/sparkR > {code} > then get error > {code} > Welcome to > __ >/ __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT > /_/ > SparkSession available as 'spark'. > > > > > > library(SparkR) > > > > df <- read.df("examples/src/main/resources/users.parquet") > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > > > > > > head(df) > 16/07/07 23:20:54 WARN ParquetRecordReader: Can not initialize counter due to > context is not a instance of TaskInputOutputContext, but is > org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl > name favorite_color favorite_numbers > 1 Alyssa3, 9, 15, 20 > 2Benred NULL > {code} > seems need to add a lib from slf4j > http://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16452) basic INFORMATION_SCHEMA support
[ https://issues.apache.org/jira/browse/SPARK-16452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368497#comment-15368497 ] Reynold Xin commented on SPARK-16452: - Thanks - feel free to break this down into multiple pieces. > basic INFORMATION_SCHEMA support > > > Key: SPARK-16452 > URL: https://issues.apache.org/jira/browse/SPARK-16452 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Attachments: INFORMATION_SCHEMAsupport.pdf > > > INFORMATION_SCHEMA is part of SQL92 support. This ticket proposes adding a > few tables as defined in SQL92 standard to Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16452) basic INFORMATION_SCHEMA support
[ https://issues.apache.org/jira/browse/SPARK-16452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368494#comment-15368494 ] Dongjoon Hyun commented on SPARK-16452: --- Sure! > basic INFORMATION_SCHEMA support > > > Key: SPARK-16452 > URL: https://issues.apache.org/jira/browse/SPARK-16452 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Attachments: INFORMATION_SCHEMAsupport.pdf > > > INFORMATION_SCHEMA is part of SQL92 support. This ticket proposes adding a > few tables as defined in SQL92 standard to Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16452) basic INFORMATION_SCHEMA support
[ https://issues.apache.org/jira/browse/SPARK-16452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368486#comment-15368486 ] Reynold Xin edited comment on SPARK-16452 at 7/8/16 9:12 PM: - cc [~dongjoon] would you be interested in working on this? was (Author: rxin): design spec > basic INFORMATION_SCHEMA support > > > Key: SPARK-16452 > URL: https://issues.apache.org/jira/browse/SPARK-16452 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Attachments: INFORMATION_SCHEMAsupport.pdf > > > INFORMATION_SCHEMA is part of SQL92 support. This ticket proposes adding a > few tables as defined in SQL92 standard to Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16452) basic INFORMATION_SCHEMA support
Reynold Xin created SPARK-16452: --- Summary: basic INFORMATION_SCHEMA support Key: SPARK-16452 URL: https://issues.apache.org/jira/browse/SPARK-16452 Project: Spark Issue Type: New Feature Components: SQL Reporter: Reynold Xin Attachments: INFORMATION_SCHEMAsupport.pdf INFORMATION_SCHEMA is part of SQL92 support. This ticket proposes adding a few tables as defined in SQL92 standard to Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16452) basic INFORMATION_SCHEMA support
[ https://issues.apache.org/jira/browse/SPARK-16452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16452: Attachment: INFORMATION_SCHEMAsupport.pdf design spec > basic INFORMATION_SCHEMA support > > > Key: SPARK-16452 > URL: https://issues.apache.org/jira/browse/SPARK-16452 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Attachments: INFORMATION_SCHEMAsupport.pdf > > > INFORMATION_SCHEMA is part of SQL92 support. This ticket proposes adding a > few tables as defined in SQL92 standard to Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16387) Reserved SQL words are not escaped by JDBC writer
[ https://issues.apache.org/jira/browse/SPARK-16387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368482#comment-15368482 ] Apache Spark commented on SPARK-16387: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/14107 > Reserved SQL words are not escaped by JDBC writer > - > > Key: SPARK-16387 > URL: https://issues.apache.org/jira/browse/SPARK-16387 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Lev > > Here is a code (imports are omitted) > object Main extends App { > val sqlSession = SparkSession.builder().config(new SparkConf(). > setAppName("Sql Test").set("spark.app.id", "SQLTest"). > set("spark.master", "local[2]"). > set("spark.ui.enabled", "false") > .setJars(Seq("/mysql/mysql-connector-java-5.1.38.jar" )) > ).getOrCreate() > import sqlSession.implicits._ > val localprops = new Properties > localprops.put("user", "") > localprops.put("password", "") > val df = sqlSession.createDataset(Seq("a","b","c")).toDF("order") > val writer = df.write > .mode(SaveMode.Append) > writer > .jdbc("jdbc:mysql://localhost:3306/test3", s"jira_test", localprops) > } > End error is : > com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error > in your SQL syntax; check the manual that corresponds to your MySQL server > version for the right syntax to use near 'order TEXT )' at line 1 > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > Clearly the reserved word has to be quoted -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16387) Reserved SQL words are not escaped by JDBC writer
[ https://issues.apache.org/jira/browse/SPARK-16387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16387: Assignee: (was: Apache Spark) > Reserved SQL words are not escaped by JDBC writer > - > > Key: SPARK-16387 > URL: https://issues.apache.org/jira/browse/SPARK-16387 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Lev > > Here is a code (imports are omitted) > object Main extends App { > val sqlSession = SparkSession.builder().config(new SparkConf(). > setAppName("Sql Test").set("spark.app.id", "SQLTest"). > set("spark.master", "local[2]"). > set("spark.ui.enabled", "false") > .setJars(Seq("/mysql/mysql-connector-java-5.1.38.jar" )) > ).getOrCreate() > import sqlSession.implicits._ > val localprops = new Properties > localprops.put("user", "") > localprops.put("password", "") > val df = sqlSession.createDataset(Seq("a","b","c")).toDF("order") > val writer = df.write > .mode(SaveMode.Append) > writer > .jdbc("jdbc:mysql://localhost:3306/test3", s"jira_test", localprops) > } > End error is : > com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error > in your SQL syntax; check the manual that corresponds to your MySQL server > version for the right syntax to use near 'order TEXT )' at line 1 > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > Clearly the reserved word has to be quoted -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16445) Multilayer Perceptron Classifier wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368479#comment-15368479 ] Xin Ren commented on SPARK-16445: - Hi Xiangrui, may I have a try on this one? Is there a strict deadline to hit? Thanks a lot > Multilayer Perceptron Classifier wrapper in SparkR > -- > > Key: SPARK-16445 > URL: https://issues.apache.org/jira/browse/SPARK-16445 > Project: Spark > Issue Type: Sub-task > Components: MLlib, SparkR >Reporter: Xiangrui Meng > > Follow instructions in SPARK-16442 and implement multilayer perceptron > classifier wrapper in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16387) Reserved SQL words are not escaped by JDBC writer
[ https://issues.apache.org/jira/browse/SPARK-16387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16387: Assignee: Apache Spark > Reserved SQL words are not escaped by JDBC writer > - > > Key: SPARK-16387 > URL: https://issues.apache.org/jira/browse/SPARK-16387 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Lev >Assignee: Apache Spark > > Here is a code (imports are omitted) > object Main extends App { > val sqlSession = SparkSession.builder().config(new SparkConf(). > setAppName("Sql Test").set("spark.app.id", "SQLTest"). > set("spark.master", "local[2]"). > set("spark.ui.enabled", "false") > .setJars(Seq("/mysql/mysql-connector-java-5.1.38.jar" )) > ).getOrCreate() > import sqlSession.implicits._ > val localprops = new Properties > localprops.put("user", "") > localprops.put("password", "") > val df = sqlSession.createDataset(Seq("a","b","c")).toDF("order") > val writer = df.write > .mode(SaveMode.Append) > writer > .jdbc("jdbc:mysql://localhost:3306/test3", s"jira_test", localprops) > } > End error is : > com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error > in your SQL syntax; check the manual that corresponds to your MySQL server > version for the right syntax to use near 'order TEXT )' at line 1 > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > Clearly the reserved word has to be quoted -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10292) make metadata query-able by adding meta table
[ https://issues.apache.org/jira/browse/SPARK-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-10292. --- Resolution: Later > make metadata query-able by adding meta table > - > > Key: SPARK-10292 > URL: https://issues.apache.org/jira/browse/SPARK-10292 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Wenchen Fan > > To begin with, we can set up 2 meta tables: catalog_tables and > catalog_columns. > catalog_tables have 3 columns: name, database, is_temporary. > catalog_columns have 5 columns: name, table, database, data_type, nullable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16451) Spark-shell / pyspark should finish gracefully when "SaslException: GSS initiate failed" is hit
Yesha Vora created SPARK-16451: -- Summary: Spark-shell / pyspark should finish gracefully when "SaslException: GSS initiate failed" is hit Key: SPARK-16451 URL: https://issues.apache.org/jira/browse/SPARK-16451 Project: Spark Issue Type: Bug Affects Versions: 1.6.1 Reporter: Yesha Vora Steps to reproduce: (secure cluster) * kdestroy * spark-shell --master yarn-client If no valid keytab is set while running spark-shell/pyspark, the spark client never exits. It keep printing below error messages. spark-client should call shutdown hook immediately and exit with proper error code. Currently, user need to explicitly shutdown process. (using cntrl+c) {code} 16/07/08 20:53:10 WARN Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:413) at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:595) at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:397) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:761) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:757) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:756) at org.apache.hadoop.ipc.Client$Connection.access$3200(Client.java:397) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1617) at org.apache.hadoop.ipc.Client.call(Client.java:1448) at org.apache.hadoop.ipc.Client.call(Client.java:1395) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at com.sun.proxy.$Proxy25.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:816) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) at com.sun.proxy.$Proxy26.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2151) at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1408) at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1404) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1404) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1437) at org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter.(FileSystemTimelineWriter.java:124) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.createTimelineWriter(TimelineClientImpl.java:316) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceStart(TimelineClientImpl.java:308) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStart(YarnClientImpl.java:194) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:127) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144) at org.apache.spark.SparkContext.(SparkContext.scala:530) at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017) at $line3.$read$$iwC$$iwC.(:15) at $line3.$read$$iwC.(:24) at $line3.$read.(:26) at $line3.$read$.(:30) at $line3.$read$.() at $line3.$eval$.(:7) at $line3.$eval$.() at $line3.$eval.$print() at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodA
[jira] [Commented] (SPARK-16387) Reserved SQL words are not escaped by JDBC writer
[ https://issues.apache.org/jira/browse/SPARK-16387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368458#comment-15368458 ] Dongjoon Hyun commented on SPARK-16387: --- You're right. I'll make a PR for this. > Reserved SQL words are not escaped by JDBC writer > - > > Key: SPARK-16387 > URL: https://issues.apache.org/jira/browse/SPARK-16387 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Lev > > Here is a code (imports are omitted) > object Main extends App { > val sqlSession = SparkSession.builder().config(new SparkConf(). > setAppName("Sql Test").set("spark.app.id", "SQLTest"). > set("spark.master", "local[2]"). > set("spark.ui.enabled", "false") > .setJars(Seq("/mysql/mysql-connector-java-5.1.38.jar" )) > ).getOrCreate() > import sqlSession.implicits._ > val localprops = new Properties > localprops.put("user", "") > localprops.put("password", "") > val df = sqlSession.createDataset(Seq("a","b","c")).toDF("order") > val writer = df.write > .mode(SaveMode.Append) > writer > .jdbc("jdbc:mysql://localhost:3306/test3", s"jira_test", localprops) > } > End error is : > com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error > in your SQL syntax; check the manual that corresponds to your MySQL server > version for the right syntax to use near 'order TEXT )' at line 1 > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > Clearly the reserved word has to be quoted -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-16387) Reserved SQL words are not escaped by JDBC writer
[ https://issues.apache.org/jira/browse/SPARK-16387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-16387: -- Comment: was deleted (was: Oh, it means Pull Request. Since you know `JdbcDialect` class, I think you can make a code patch for that.) > Reserved SQL words are not escaped by JDBC writer > - > > Key: SPARK-16387 > URL: https://issues.apache.org/jira/browse/SPARK-16387 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Lev > > Here is a code (imports are omitted) > object Main extends App { > val sqlSession = SparkSession.builder().config(new SparkConf(). > setAppName("Sql Test").set("spark.app.id", "SQLTest"). > set("spark.master", "local[2]"). > set("spark.ui.enabled", "false") > .setJars(Seq("/mysql/mysql-connector-java-5.1.38.jar" )) > ).getOrCreate() > import sqlSession.implicits._ > val localprops = new Properties > localprops.put("user", "") > localprops.put("password", "") > val df = sqlSession.createDataset(Seq("a","b","c")).toDF("order") > val writer = df.write > .mode(SaveMode.Append) > writer > .jdbc("jdbc:mysql://localhost:3306/test3", s"jira_test", localprops) > } > End error is : > com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error > in your SQL syntax; check the manual that corresponds to your MySQL server > version for the right syntax to use near 'order TEXT )' at line 1 > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > Clearly the reserved word has to be quoted -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-16387) Reserved SQL words are not escaped by JDBC writer
[ https://issues.apache.org/jira/browse/SPARK-16387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-16387: -- Comment: was deleted (was: Hi, `escaping` sounds possible, but it is not an easy issue to implement that portably for all DB. We need to support various MySQL, PostgreSQL, MSSQL, and so on. The standard is double quote ("), but even MySQL does not support that naturally. (Only supported in a ANSI mode?). MySQL uses backtick (`), but PostgreSQL does not (if I remember correctly.) MSSQL uses '[]'. I want to help you with this issue, but I've no idea. Do you have any idea for this? ) > Reserved SQL words are not escaped by JDBC writer > - > > Key: SPARK-16387 > URL: https://issues.apache.org/jira/browse/SPARK-16387 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Lev > > Here is a code (imports are omitted) > object Main extends App { > val sqlSession = SparkSession.builder().config(new SparkConf(). > setAppName("Sql Test").set("spark.app.id", "SQLTest"). > set("spark.master", "local[2]"). > set("spark.ui.enabled", "false") > .setJars(Seq("/mysql/mysql-connector-java-5.1.38.jar" )) > ).getOrCreate() > import sqlSession.implicits._ > val localprops = new Properties > localprops.put("user", "") > localprops.put("password", "") > val df = sqlSession.createDataset(Seq("a","b","c")).toDF("order") > val writer = df.write > .mode(SaveMode.Append) > writer > .jdbc("jdbc:mysql://localhost:3306/test3", s"jira_test", localprops) > } > End error is : > com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error > in your SQL syntax; check the manual that corresponds to your MySQL server > version for the right syntax to use near 'order TEXT )' at line 1 > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > Clearly the reserved word has to be quoted -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-16387) Reserved SQL words are not escaped by JDBC writer
[ https://issues.apache.org/jira/browse/SPARK-16387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-16387: -- Comment: was deleted (was: Then, could you make a PR for this?) > Reserved SQL words are not escaped by JDBC writer > - > > Key: SPARK-16387 > URL: https://issues.apache.org/jira/browse/SPARK-16387 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Lev > > Here is a code (imports are omitted) > object Main extends App { > val sqlSession = SparkSession.builder().config(new SparkConf(). > setAppName("Sql Test").set("spark.app.id", "SQLTest"). > set("spark.master", "local[2]"). > set("spark.ui.enabled", "false") > .setJars(Seq("/mysql/mysql-connector-java-5.1.38.jar" )) > ).getOrCreate() > import sqlSession.implicits._ > val localprops = new Properties > localprops.put("user", "") > localprops.put("password", "") > val df = sqlSession.createDataset(Seq("a","b","c")).toDF("order") > val writer = df.write > .mode(SaveMode.Append) > writer > .jdbc("jdbc:mysql://localhost:3306/test3", s"jira_test", localprops) > } > End error is : > com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error > in your SQL syntax; check the manual that corresponds to your MySQL server > version for the right syntax to use near 'order TEXT )' at line 1 > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > Clearly the reserved word has to be quoted -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13346) Using DataFrames iteratively leads to slow query planning
[ https://issues.apache.org/jira/browse/SPARK-13346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-13346: -- Summary: Using DataFrames iteratively leads to slow query planning (was: Using DataFrames iteratively leads to massive query plans, which slows execution) > Using DataFrames iteratively leads to slow query planning > - > > Key: SPARK-13346 > URL: https://issues.apache.org/jira/browse/SPARK-13346 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Joseph K. Bradley > > I have an iterative algorithm based on DataFrames, and the query plan grows > very quickly with each iteration. Caching the current DataFrame at the end > of an iteration does not fix the problem. However, converting the DataFrame > to an RDD and back at the end of each iteration does fix the problem. > Printing the query plans shows that the plan explodes quickly (10 lines, to > several hundred lines, to several thousand lines, ...) with successive > iterations. > The desired behavior is for the analyzer to recognize that a big chunk of the > query plan does not need to be computed since it is already cached. The > computation on each iteration should be the same. > If useful, I can push (complex) code to reproduce the issue. But it should > be simple to see if you create an iterative algorithm which produces a new > DataFrame from an old one on each iteration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368320#comment-15368320 ] holdenk commented on SPARK-15581: - Yah - the more I look at it the more rough it seems - [~tdas] has it in his slides as targeted for Spark 2.1+ (http://www.slideshare.net/databricks/a-deep-dive-into-structured-streaming) > MLlib 2.1 Roadmap > - > > Key: SPARK-15581 > URL: https://issues.apache.org/jira/browse/SPARK-15581 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > Labels: roadmap > > This is a master list for MLlib improvements we are working on for the next > release. Please view this as a wish list rather than a definite plan, for we > don't have an accurate estimate of available resources. Due to limited review > bandwidth, features appearing on this list will get higher priority during > code review. But feel free to suggest new items to the list in comments. We > are experimenting with this process. Your feedback would be greatly > appreciated. > h1. Instructions > h2. For contributors: > * Please read > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > carefully. Code style, documentation, and unit tests are important. > * If you are a first-time Spark contributor, please always start with a > [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather > than a medium/big feature. Based on our experience, mixing the development > process with a big feature usually causes long delay in code review. > * Never work silently. Let everyone know on the corresponding JIRA page when > you start working on some features. This is to avoid duplicate work. For > small features, you don't need to wait to get JIRA assigned. > * For medium/big features or features with dependencies, please get assigned > first before coding and keep the ETA updated on the JIRA. If there exist no > activity on the JIRA page for a certain amount of time, the JIRA should be > released for other contributors. > * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one > after another. > * Remember to add the `@Since("VERSION")` annotation to new public APIs. > * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code > review greatly helps to improve others' code as well as yours. > h2. For committers: > * Try to break down big features into small and specific JIRA tasks and link > them properly. > * Add a "starter" label to starter tasks. > * Put a rough estimate for medium/big features and track the progress. > * If you start reviewing a PR, please add yourself to the Shepherd field on > JIRA. > * If the code looks good to you, please comment "LGTM". For non-trivial PRs, > please ping a maintainer to make a final pass. > * After merging a PR, create and link JIRAs for Python, example code, and > documentation if applicable. > h1. Roadmap (*WIP*) > This is NOT [a complete list of MLlib JIRAs for 2.1| > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority]. > We only include umbrella JIRAs and high-level tasks. > Major efforts in this release: > * Feature parity for the DataFrames-based API (`spark.ml`), relative to the > RDD-based API > * ML persistence > * Python API feature parity and test coverage > * R API expansion and improvements > * Note about new features: As usual, we expect to expand the feature set of > MLlib. However, we will prioritize API parity, bug fixes, and improvements > over new features. > Note `spark.mllib` is in maintenance mode now. We will accept bug fixes for > it, but new features, APIs, and improvements will only be added to `spark.ml`. > h2. Critical feature parity in DataFrame-based API > * Umbrella JIRA: [SPARK-4591] > h2. Persistence > * Complete persistence within MLlib > ** Python tuning (SPARK-13786) > * MLlib in R format: compatibility with other languages (SPARK-15572) > * Impose backwards compatibility for persistence (SPARK-15573) > h2. Python API > * Standardize unit tests for Scala and Python to improve and consolidate test > coverage for Params, persistence, and other common functionality (SPARK-15571) > * Improve Python API handling of Params, persistence (SPARK-14771) > (SPARK-14706) > ** Note: The linked JIRAs for this are incomplete. More to be created... > ** Related: Implement Python meta-algorithms in Scala (to simplify > persistence) (SPARK-15574) > * Feature parity: The main goal o
[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368318#comment-15368318 ] Nick Pentreath commented on SPARK-15581: I think it would be a pretty interesting to explore a (probably fairly experimental) mechanism to train on structured streams/DFs and sink to a "prediction" stream and/or some "model state store" > MLlib 2.1 Roadmap > - > > Key: SPARK-15581 > URL: https://issues.apache.org/jira/browse/SPARK-15581 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > Labels: roadmap > > This is a master list for MLlib improvements we are working on for the next > release. Please view this as a wish list rather than a definite plan, for we > don't have an accurate estimate of available resources. Due to limited review > bandwidth, features appearing on this list will get higher priority during > code review. But feel free to suggest new items to the list in comments. We > are experimenting with this process. Your feedback would be greatly > appreciated. > h1. Instructions > h2. For contributors: > * Please read > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > carefully. Code style, documentation, and unit tests are important. > * If you are a first-time Spark contributor, please always start with a > [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather > than a medium/big feature. Based on our experience, mixing the development > process with a big feature usually causes long delay in code review. > * Never work silently. Let everyone know on the corresponding JIRA page when > you start working on some features. This is to avoid duplicate work. For > small features, you don't need to wait to get JIRA assigned. > * For medium/big features or features with dependencies, please get assigned > first before coding and keep the ETA updated on the JIRA. If there exist no > activity on the JIRA page for a certain amount of time, the JIRA should be > released for other contributors. > * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one > after another. > * Remember to add the `@Since("VERSION")` annotation to new public APIs. > * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code > review greatly helps to improve others' code as well as yours. > h2. For committers: > * Try to break down big features into small and specific JIRA tasks and link > them properly. > * Add a "starter" label to starter tasks. > * Put a rough estimate for medium/big features and track the progress. > * If you start reviewing a PR, please add yourself to the Shepherd field on > JIRA. > * If the code looks good to you, please comment "LGTM". For non-trivial PRs, > please ping a maintainer to make a final pass. > * After merging a PR, create and link JIRAs for Python, example code, and > documentation if applicable. > h1. Roadmap (*WIP*) > This is NOT [a complete list of MLlib JIRAs for 2.1| > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority]. > We only include umbrella JIRAs and high-level tasks. > Major efforts in this release: > * Feature parity for the DataFrames-based API (`spark.ml`), relative to the > RDD-based API > * ML persistence > * Python API feature parity and test coverage > * R API expansion and improvements > * Note about new features: As usual, we expect to expand the feature set of > MLlib. However, we will prioritize API parity, bug fixes, and improvements > over new features. > Note `spark.mllib` is in maintenance mode now. We will accept bug fixes for > it, but new features, APIs, and improvements will only be added to `spark.ml`. > h2. Critical feature parity in DataFrame-based API > * Umbrella JIRA: [SPARK-4591] > h2. Persistence > * Complete persistence within MLlib > ** Python tuning (SPARK-13786) > * MLlib in R format: compatibility with other languages (SPARK-15572) > * Impose backwards compatibility for persistence (SPARK-15573) > h2. Python API > * Standardize unit tests for Scala and Python to improve and consolidate test > coverage for Params, persistence, and other common functionality (SPARK-15571) > * Improve Python API handling of Params, persistence (SPARK-14771) > (SPARK-14706) > ** Note: The linked JIRAs for this are incomplete. More to be created... > ** Related: Implement Python meta-algorithms in Scala (to simplify > persistence) (SPARK-15574) > * Feature
[jira] [Commented] (SPARK-16365) Ideas for moving "mllib-local" forward
[ https://issues.apache.org/jira/browse/SPARK-16365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368310#comment-15368310 ] Nick Pentreath commented on SPARK-16365: Good question - and part of the reason for getting discussion going here. In general (IMO) the short answer is "no" - I think Spark should be the tool for training models on moderately large to extremely large datasets, but not necessarily for completely general machine learning. I think the idea behind {{mllib-local}} is potentially two-fold: (i) make it easier to use Spark models / pipelines in production scenarios, and (ii) enhance linalg primitives available to devs / users. > Ideas for moving "mllib-local" forward > -- > > Key: SPARK-16365 > URL: https://issues.apache.org/jira/browse/SPARK-16365 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Nick Pentreath > > Since SPARK-13944 is all done, we should all think about what the "next > steps" might be for {{mllib-local}}. E.g., it could be "improve Spark's > linear algebra", or "investigate how we will implement local models/pipelines > in Spark", etc. > This ticket is for comments, ideas, brainstormings and PoCs. The separation > of linalg into a standalone project turned out to be significantly more > complex than originally expected. So I vote we devote sufficient discussion > and time to planning out the next move :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16420) UnsafeShuffleWriter leaks compression streams with off-heap memory.
[ https://issues.apache.org/jira/browse/SPARK-16420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16420. - Resolution: Fixed Assignee: Ryan Blue Fix Version/s: 2.0.0 > UnsafeShuffleWriter leaks compression streams with off-heap memory. > --- > > Key: SPARK-16420 > URL: https://issues.apache.org/jira/browse/SPARK-16420 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Ryan Blue >Assignee: Ryan Blue > Fix For: 2.0.0 > > > When merging spill files using Java file streams, {{UnsafeShuffleWriter}} > opens a decompression stream for each shuffle part and [never closes > them|https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/shuffle/sort/UnsafeShuffleWriter.java#L352-357]. > When the compression codec holds off-heap resources, these aren't cleaned up > until the finalizer is called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-16450) Build failes for Mesos 0.28.x
[ https://issues.apache.org/jira/browse/SPARK-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niels Becker updated SPARK-16450: - Comment: was deleted (was: Spark 1.6.1 was working with Mesos 0.28.0) > Build failes for Mesos 0.28.x > - > > Key: SPARK-16450 > URL: https://issues.apache.org/jira/browse/SPARK-16450 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.0.0 > Environment: Mesos 0.28.0 >Reporter: Niels Becker > > Build fails: > [error] > /usr/local/spark/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala:82: > type mismatch; > [error] found : org.apache.mesos.protobuf.ByteString > [error] required: String > [error] credBuilder.setSecret(ByteString.copyFromUtf8(secret)) > Build cmd: > dev/make-distribution.sh --tgz -Phadoop-2.7 -Dhadoop.version=2.7.0 -Phive > -DskipTests -Dmesos.version=0.28.0 -Djava.version=1.8 > Spark Version: 2.0.0-rc2 > Java: OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-1~bpo8+1-b14 > Scala Version: 2.11.8 > Same error for mesos.version=0.28.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16449) unionAll raises "Task not serializable"
[ https://issues.apache.org/jira/browse/SPARK-16449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368249#comment-15368249 ] Dongjoon Hyun commented on SPARK-16449: --- SPARK-16173 resolved that kind of issue. > unionAll raises "Task not serializable" > --- > > Key: SPARK-16449 > URL: https://issues.apache.org/jira/browse/SPARK-16449 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 > Environment: AWS EMR, Jupyter notebook >Reporter: Jeff Levy >Priority: Minor > > Goal: Take the output from `describe` on a large DataFrame, then use a loop > to calculate `skewness` and `kurtosis` from pyspark.sql.functions for each > column, build them into a DataFrame of two rows, then use `unionAll` to merge > them together. > Issue: Despite having the same column names, in the same order with the same > dtypes, the `unionAll` fails with "Task not serializable". However, if I > build two test rows using dummy data then `unionAll` works fine. Also, if I > collect my results then turn them straight back into DataFrames, `unionAll` > succeeds. > Step-by-step code and output with comments can be seen here: > https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb > The issue appears to be in the way the loop in code block 6 is building the > rows before parallelizing, but the results look no different from the test > rows that do work. I reproduced this on multiple datasets, so downloading > the notebook and pointing it to any data of your own should replicate it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-16449) unionAll raises "Task not serializable"
[ https://issues.apache.org/jira/browse/SPARK-16449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-16449: -- Comment: was deleted (was: Oh, I see.) > unionAll raises "Task not serializable" > --- > > Key: SPARK-16449 > URL: https://issues.apache.org/jira/browse/SPARK-16449 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 > Environment: AWS EMR, Jupyter notebook >Reporter: Jeff Levy >Priority: Minor > > Goal: Take the output from `describe` on a large DataFrame, then use a loop > to calculate `skewness` and `kurtosis` from pyspark.sql.functions for each > column, build them into a DataFrame of two rows, then use `unionAll` to merge > them together. > Issue: Despite having the same column names, in the same order with the same > dtypes, the `unionAll` fails with "Task not serializable". However, if I > build two test rows using dummy data then `unionAll` works fine. Also, if I > collect my results then turn them straight back into DataFrames, `unionAll` > succeeds. > Step-by-step code and output with comments can be seen here: > https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb > The issue appears to be in the way the loop in code block 6 is building the > rows before parallelizing, but the results look no different from the test > rows that do work. I reproduced this on multiple datasets, so downloading > the notebook and pointing it to any data of your own should replicate it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16449) unionAll raises "Task not serializable"
[ https://issues.apache.org/jira/browse/SPARK-16449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368245#comment-15368245 ] Dongjoon Hyun commented on SPARK-16449: --- It seems you are using the result of `describe` at the `unionAll` directly. I mean `df_described`. ``` df_described = df.describe() expanded_describe = df_described.unionAll(df_described2) expanded_describe.show() ``` > unionAll raises "Task not serializable" > --- > > Key: SPARK-16449 > URL: https://issues.apache.org/jira/browse/SPARK-16449 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 > Environment: AWS EMR, Jupyter notebook >Reporter: Jeff Levy >Priority: Minor > > Goal: Take the output from `describe` on a large DataFrame, then use a loop > to calculate `skewness` and `kurtosis` from pyspark.sql.functions for each > column, build them into a DataFrame of two rows, then use `unionAll` to merge > them together. > Issue: Despite having the same column names, in the same order with the same > dtypes, the `unionAll` fails with "Task not serializable". However, if I > build two test rows using dummy data then `unionAll` works fine. Also, if I > collect my results then turn them straight back into DataFrames, `unionAll` > succeeds. > Step-by-step code and output with comments can be seen here: > https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb > The issue appears to be in the way the loop in code block 6 is building the > rows before parallelizing, but the results look no different from the test > rows that do work. I reproduced this on multiple datasets, so downloading > the notebook and pointing it to any data of your own should replicate it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16449) unionAll raises "Task not serializable"
[ https://issues.apache.org/jira/browse/SPARK-16449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368232#comment-15368232 ] Dongjoon Hyun commented on SPARK-16449: --- Oh, I see. > unionAll raises "Task not serializable" > --- > > Key: SPARK-16449 > URL: https://issues.apache.org/jira/browse/SPARK-16449 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 > Environment: AWS EMR, Jupyter notebook >Reporter: Jeff Levy >Priority: Minor > > Goal: Take the output from `describe` on a large DataFrame, then use a loop > to calculate `skewness` and `kurtosis` from pyspark.sql.functions for each > column, build them into a DataFrame of two rows, then use `unionAll` to merge > them together. > Issue: Despite having the same column names, in the same order with the same > dtypes, the `unionAll` fails with "Task not serializable". However, if I > build two test rows using dummy data then `unionAll` works fine. Also, if I > collect my results then turn them straight back into DataFrames, `unionAll` > succeeds. > Step-by-step code and output with comments can be seen here: > https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb > The issue appears to be in the way the loop in code block 6 is building the > rows before parallelizing, but the results look no different from the test > rows that do work. I reproduced this on multiple datasets, so downloading > the notebook and pointing it to any data of your own should replicate it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16449) unionAll raises "Task not serializable"
[ https://issues.apache.org/jira/browse/SPARK-16449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368231#comment-15368231 ] Dongjoon Hyun commented on SPARK-16449: --- Unfortunately, I cannot test the notebook due to the given S3 bucket is not public. You can try that in Databricks Community Edition. There is 2.0.0 RC2 including SPARK-16173 . > unionAll raises "Task not serializable" > --- > > Key: SPARK-16449 > URL: https://issues.apache.org/jira/browse/SPARK-16449 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 > Environment: AWS EMR, Jupyter notebook >Reporter: Jeff Levy >Priority: Minor > > Goal: Take the output from `describe` on a large DataFrame, then use a loop > to calculate `skewness` and `kurtosis` from pyspark.sql.functions for each > column, build them into a DataFrame of two rows, then use `unionAll` to merge > them together. > Issue: Despite having the same column names, in the same order with the same > dtypes, the `unionAll` fails with "Task not serializable". However, if I > build two test rows using dummy data then `unionAll` works fine. Also, if I > collect my results then turn them straight back into DataFrames, `unionAll` > succeeds. > Step-by-step code and output with comments can be seen here: > https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb > The issue appears to be in the way the loop in code block 6 is building the > rows before parallelizing, but the results look no different from the test > rows that do work. I reproduced this on multiple datasets, so downloading > the notebook and pointing it to any data of your own should replicate it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16449) unionAll raises "Task not serializable"
[ https://issues.apache.org/jira/browse/SPARK-16449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368226#comment-15368226 ] Jeff Levy commented on SPARK-16449: --- The problem doesn't appear to be with `describe` here. If I leave the describe DataFrame unchanged but turn the data for merging into a list via collect and then back into a DataFrame, `unionAll` works. > unionAll raises "Task not serializable" > --- > > Key: SPARK-16449 > URL: https://issues.apache.org/jira/browse/SPARK-16449 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 > Environment: AWS EMR, Jupyter notebook >Reporter: Jeff Levy >Priority: Minor > > Goal: Take the output from `describe` on a large DataFrame, then use a loop > to calculate `skewness` and `kurtosis` from pyspark.sql.functions for each > column, build them into a DataFrame of two rows, then use `unionAll` to merge > them together. > Issue: Despite having the same column names, in the same order with the same > dtypes, the `unionAll` fails with "Task not serializable". However, if I > build two test rows using dummy data then `unionAll` works fine. Also, if I > collect my results then turn them straight back into DataFrames, `unionAll` > succeeds. > Step-by-step code and output with comments can be seen here: > https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb > The issue appears to be in the way the loop in code block 6 is building the > rows before parallelizing, but the results look no different from the test > rows that do work. I reproduced this on multiple datasets, so downloading > the notebook and pointing it to any data of your own should replicate it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3728) RandomForest: Learn models too large to store in memory
[ https://issues.apache.org/jira/browse/SPARK-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368216#comment-15368216 ] Manoj Kumar commented on SPARK-3728: Hi [~xusen]. Are you still working on this? > RandomForest: Learn models too large to store in memory > --- > > Key: SPARK-3728 > URL: https://issues.apache.org/jira/browse/SPARK-3728 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley > > Proposal: Write trees to disk as they are learned. > RandomForest currently uses a FIFO queue, which means training all trees at > once via breadth-first search. Using a FILO queue would encourage the code > to finish one tree before moving on to new ones. This would allow the code > to write trees to disk as they are learned. > Note: It would also be possible to write nodes to disk as they are learned > using a FIFO queue, once the example--node mapping is cached [JIRA]. The > [Sequoia Forest package]() does this. However, it could be useful to learn > trees progressively, so that future functionality such as early stopping > (training fewer trees than expected) could be supported. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16449) unionAll raises "Task not serializable"
[ https://issues.apache.org/jira/browse/SPARK-16449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368204#comment-15368204 ] Dongjoon Hyun edited comment on SPARK-16449 at 7/8/16 7:03 PM: --- In that issue, the root cause was `describe` itself. It was merged into both master and 1.6 branch. was (Author: dongjoon): In that issue, the root cause was `describe` itself. > unionAll raises "Task not serializable" > --- > > Key: SPARK-16449 > URL: https://issues.apache.org/jira/browse/SPARK-16449 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 > Environment: AWS EMR, Jupyter notebook >Reporter: Jeff Levy >Priority: Minor > > Goal: Take the output from `describe` on a large DataFrame, then use a loop > to calculate `skewness` and `kurtosis` from pyspark.sql.functions for each > column, build them into a DataFrame of two rows, then use `unionAll` to merge > them together. > Issue: Despite having the same column names, in the same order with the same > dtypes, the `unionAll` fails with "Task not serializable". However, if I > build two test rows using dummy data then `unionAll` works fine. Also, if I > collect my results then turn them straight back into DataFrames, `unionAll` > succeeds. > Step-by-step code and output with comments can be seen here: > https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb > The issue appears to be in the way the loop in code block 6 is building the > rows before parallelizing, but the results look no different from the test > rows that do work. I reproduced this on multiple datasets, so downloading > the notebook and pointing it to any data of your own should replicate it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16449) unionAll raises "Task not serializable"
[ https://issues.apache.org/jira/browse/SPARK-16449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368204#comment-15368204 ] Dongjoon Hyun commented on SPARK-16449: --- In that issue, the root cause was `describe` itself. > unionAll raises "Task not serializable" > --- > > Key: SPARK-16449 > URL: https://issues.apache.org/jira/browse/SPARK-16449 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 > Environment: AWS EMR, Jupyter notebook >Reporter: Jeff Levy >Priority: Minor > > Goal: Take the output from `describe` on a large DataFrame, then use a loop > to calculate `skewness` and `kurtosis` from pyspark.sql.functions for each > column, build them into a DataFrame of two rows, then use `unionAll` to merge > them together. > Issue: Despite having the same column names, in the same order with the same > dtypes, the `unionAll` fails with "Task not serializable". However, if I > build two test rows using dummy data then `unionAll` works fine. Also, if I > collect my results then turn them straight back into DataFrames, `unionAll` > succeeds. > Step-by-step code and output with comments can be seen here: > https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb > The issue appears to be in the way the loop in code block 6 is building the > rows before parallelizing, but the results look no different from the test > rows that do work. I reproduced this on multiple datasets, so downloading > the notebook and pointing it to any data of your own should replicate it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16450) Build failes for Mesos 0.28.x
[ https://issues.apache.org/jira/browse/SPARK-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368199#comment-15368199 ] Niels Becker commented on SPARK-16450: -- Spark 1.6.1 was working with Mesos 0.28.0 > Build failes for Mesos 0.28.x > - > > Key: SPARK-16450 > URL: https://issues.apache.org/jira/browse/SPARK-16450 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.0.0 > Environment: Mesos 0.28.0 >Reporter: Niels Becker > > Build fails: > [error] > /usr/local/spark/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala:82: > type mismatch; > [error] found : org.apache.mesos.protobuf.ByteString > [error] required: String > [error] credBuilder.setSecret(ByteString.copyFromUtf8(secret)) > Build cmd: > dev/make-distribution.sh --tgz -Phadoop-2.7 -Dhadoop.version=2.7.0 -Phive > -DskipTests -Dmesos.version=0.28.0 -Djava.version=1.8 > Spark Version: 2.0.0-rc2 > Java: OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-1~bpo8+1-b14 > Scala Version: 2.11.8 > Same error for mesos.version=0.28.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16449) unionAll raises "Task not serializable"
[ https://issues.apache.org/jira/browse/SPARK-16449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368201#comment-15368201 ] Dongjoon Hyun commented on SPARK-16449: --- Could you try this in master? It seems to be similar with SPARK-16173 . > unionAll raises "Task not serializable" > --- > > Key: SPARK-16449 > URL: https://issues.apache.org/jira/browse/SPARK-16449 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 > Environment: AWS EMR, Jupyter notebook >Reporter: Jeff Levy >Priority: Minor > > Goal: Take the output from `describe` on a large DataFrame, then use a loop > to calculate `skewness` and `kurtosis` from pyspark.sql.functions for each > column, build them into a DataFrame of two rows, then use `unionAll` to merge > them together. > Issue: Despite having the same column names, in the same order with the same > dtypes, the `unionAll` fails with "Task not serializable". However, if I > build two test rows using dummy data then `unionAll` works fine. Also, if I > collect my results then turn them straight back into DataFrames, `unionAll` > succeeds. > Step-by-step code and output with comments can be seen here: > https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb > The issue appears to be in the way the loop in code block 6 is building the > rows before parallelizing, but the results look no different from the test > rows that do work. I reproduced this on multiple datasets, so downloading > the notebook and pointing it to any data of your own should replicate it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13638) Support for saving with a quote mode
[ https://issues.apache.org/jira/browse/SPARK-13638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-13638. - Resolution: Fixed Assignee: Jurriaan Pruis Fix Version/s: 2.0.1 > Support for saving with a quote mode > > > Key: SPARK-13638 > URL: https://issues.apache.org/jira/browse/SPARK-13638 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Assignee: Jurriaan Pruis >Priority: Minor > Fix For: 2.0.1 > > > https://github.com/databricks/spark-csv/pull/254 > tobithiel reported this. > {quote} > I'm dealing with some messy csv files and being able to just quote all fields > is very useful, > so that other applications don't misunderstand the file because of some > sketchy characters > {quote} > When writing there are several quote modes in apache commons csv. (See > https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html) > This might have to be supported. > However, it looks univocity parser used for writing (it looks currently only > this library is supported) does not support this quote mode. I think we can > drop this backwards compatibility if we are not going to add apache commons > csv. > This is a reminder that it might break backwards compatibility for the > options, {{quoteMode}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16447) LDA wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368149#comment-15368149 ] Xusen Yin commented on SPARK-16447: --- [~mengxr] I'd like to work on this. > LDA wrapper in SparkR > - > > Key: SPARK-16447 > URL: https://issues.apache.org/jira/browse/SPARK-16447 > Project: Spark > Issue Type: Sub-task > Components: MLlib, SparkR >Reporter: Xiangrui Meng > > Follow instructions in SPARK-16442 and implement LDA wrapper in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16449) unionAll raises "Task not serializable"
[ https://issues.apache.org/jira/browse/SPARK-16449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368143#comment-15368143 ] Sean Owen commented on SPARK-16449: --- Interesting. Looks like something is serializing a Scala Iterator and it doesn't work. The relevant subset is below. Hm, maybe something in LocalTableScan can be rejiggered to avoid this. {code} Caused by: java.io.NotSerializableException: scala.collection.Iterator$$anon$11 Serialization stack: - object not serializable (class: scala.collection.Iterator$$anon$11, value: empty iterator) - field (class: scala.collection.Iterator$$anonfun$toStream$1, name: $outer, type: interface scala.collection.Iterator) - object (class scala.collection.Iterator$$anonfun$toStream$1, ) - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: interface scala.Function0) - object (class scala.collection.immutable.Stream$Cons, Stream(WrappedArray(3526154, 3526154, 1580402, 3526154, 3526154), WrappedArray(5.50388599500189E11, 4.178168090221903, 234846.780654818, 5.134865351881966, 354.7084951479714), WrappedArray(2.596112361975223E11, 0.34382335723646484, 118170.68592261613, 3.3833930336063456, 4.011812510792076), WrappedArray(12091588, 2.75, 0.85, -1, 292), WrappedArray(95696635, 6.125, 1193544.39, 34, 480))) - field (class: scala.collection.immutable.Stream$$anonfun$zip$1, name: $outer, type: class scala.collection.immutable.Stream) - object (class scala.collection.immutable.Stream$$anonfun$zip$1, ) - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: interface scala.Function0) - object (class scala.collection.immutable.Stream$Cons, Stream((WrappedArray(3526154, 3526154, 1580402, 3526154, 3526154),(count,)), (WrappedArray(5.50388599500189E11, 4.178168090221903, 234846.780654818, 5.134865351881966, 354.7084951479714),(mean,)), (WrappedArray(2.596112361975223E11, 0.34382335723646484, 118170.68592261613, 3.3833930336063456, 4.011812510792076),(stddev,)), (WrappedArray(12091588, 2.75, 0.85, -1, 292),(min,)), (WrappedArray(95696635, 6.125, 1193544.39, 34, 480),(max, - field (class: scala.collection.immutable.Stream$$anonfun$map$1, name: $outer, type: class scala.collection.immutable.Stream) - object (class scala.collection.immutable.Stream$$anonfun$map$1, ) - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: interface scala.Function0) - object (class scala.collection.immutable.Stream$Cons, Stream([count,3526154,3526154,1580402,3526154,3526154], [mean,5.50388599500189E11,4.178168090221903,234846.780654818,5.134865351881966,354.7084951479714], [stddev,2.596112361975223E11,0.34382335723646484,118170.68592261613,3.3833930336063456,4.011812510792076], [min,12091588,2.75,0.85,-1,292], [max,95696635,6.125,1193544.39,34,480])) - field (class: scala.collection.immutable.Stream$$anonfun$map$1, name: $outer, type: class scala.collection.immutable.Stream) - object (class scala.collection.immutable.Stream$$anonfun$map$1, ) - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: interface scala.Function0) - object (class scala.collection.immutable.Stream$Cons, Stream([count,3526154,3526154,1580402,3526154,3526154], [mean,5.50388599500189E11,4.178168090221903,234846.780654818,5.134865351881966,354.7084951479714], [stddev,2.596112361975223E11,0.34382335723646484,118170.68592261613,3.3833930336063456,4.011812510792076], [min,12091588,2.75,0.85,-1,292], [max,95696635,6.125,1193544.39,34,480])) - field (class: org.apache.spark.sql.execution.LocalTableScan, name: rows, type: interface scala.collection.Seq) - object (class org.apache.spark.sql.execution.LocalTableScan, LocalTableScan [summary#228,C0#229,C3#230,C4#231,C5#232,C6#233], [[count,3526154,3526154,1580402,3526154,3526154],[mean,5.50388599500189E11,4.178168090221903,234846.780654818,5.134865351881966,354.7084951479714],[stddev,2.596112361975223E11,0.34382335723646484,118170.68592261613,3.3833930336063456,4.011812510792076],[min,12091588,2.75,0.85,-1,292],[max,95696635,6.125,1193544.39,34,480]] ) - field (class: org.apache.spark.sql.execution.ConvertToUnsafe, name: child, type: class org.apache.spark.sql.execution.SparkPlan) - object (class org.apache.spark.sql.execution.ConvertToUnsafe, ConvertToUnsafe +- LocalTableScan [summary#228,C0#229,C3#230,C4#231,C5#232,C6#233], [[count,3526154,3526154,1580402,3526154,3526154],[mean,5.50388599500189E11,4.178168090221903,234846.780654818,5.134865351881966,354.7084951479714],[stddev,2.596112361975223E11,0.34382335723646484,118170.68592261613,3.3833930336063456,4.011812510792076],[min,12091588,2.75,0.85,-1,292],[max,95696635,6.125,1193544.39,34,480]] ) - field (class: org
[jira] [Commented] (SPARK-16450) Build failes for Mesos 0.28.x
[ https://issues.apache.org/jira/browse/SPARK-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368138#comment-15368138 ] Sean Owen commented on SPARK-16450: --- I don't think this version of Mesos is supported; it sounds like the API changed in a mutually incompatible way. > Build failes for Mesos 0.28.x > - > > Key: SPARK-16450 > URL: https://issues.apache.org/jira/browse/SPARK-16450 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.0.0 > Environment: Mesos 0.28.0 >Reporter: Niels Becker > > Build fails: > [error] > /usr/local/spark/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala:82: > type mismatch; > [error] found : org.apache.mesos.protobuf.ByteString > [error] required: String > [error] credBuilder.setSecret(ByteString.copyFromUtf8(secret)) > Build cmd: > dev/make-distribution.sh --tgz -Phadoop-2.7 -Dhadoop.version=2.7.0 -Phive > -DskipTests -Dmesos.version=0.28.0 -Djava.version=1.8 > Spark Version: 2.0.0-rc2 > Java: OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-1~bpo8+1-b14 > Scala Version: 2.11.8 > Same error for mesos.version=0.28.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16450) Build failes for Mesos 0.28.x
Niels Becker created SPARK-16450: Summary: Build failes for Mesos 0.28.x Key: SPARK-16450 URL: https://issues.apache.org/jira/browse/SPARK-16450 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 2.0.0 Environment: Mesos 0.28.0 Reporter: Niels Becker Build fails: [error] /usr/local/spark/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala:82: type mismatch; [error] found : org.apache.mesos.protobuf.ByteString [error] required: String [error] credBuilder.setSecret(ByteString.copyFromUtf8(secret)) Build cmd: dev/make-distribution.sh --tgz -Phadoop-2.7 -Dhadoop.version=2.7.0 -Phive -DskipTests -Dmesos.version=0.28.0 -Djava.version=1.8 Spark Version: 2.0.0-rc2 Java: OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-1~bpo8+1-b14 Scala Version: 2.11.8 Same error for mesos.version=0.28.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16449) unionAll raises "Task not serializable"
[ https://issues.apache.org/jira/browse/SPARK-16449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Levy updated SPARK-16449: -- Description: Goal: Take the output from `describe` on a large DataFrame, then use a loop to calculate `skewness` and `kurtosis` from pyspark.sql.functions for each column, build them into a DataFrame of two rows, then use `unionAll` to merge them together. Issue: Despite having the same column names, in the same order with the same dtypes, the `unionAll` fails with "Task not serializable". However, if I build two test rows using dummy data then `unionAll` works fine. Also, if I collect my results then turn them straight back into DataFrames, `unionAll` succeeds. Step-by-step code and output with comments can be seen here: https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb The issue appears to be in the way the loop in code block 6 is building the rows before parallelizing, but the results look no different from the test rows that do work. I reproduced this on multiple datasets, so downloading the notebook and pointing it to any data of your own should replicate it. was: Goal: Take the output from `describe` on a large DataFrame, then use a loop to calculate `skewness` and `kurtosis` from pyspark.sql.functions for each column, build them into a DataFrame of two rows, then use `unionAll` to merge them together. Issue: Despite having the same column names, in the same order with the same dtypes, the `unionAll` fails with "Task not serializable". However, if I build two test rows using dummy data then `unionAll` works fine. Also, if I collect my results then turn them straight back into DataFrames, `unionAll` succeeds. Step-by-step code and output with comments can be seen here: https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb The issue appears to be in the way the loop in code block 6 is building the rows before parallelizing, but the results look no different from the test rows that do work. > unionAll raises "Task not serializable" > --- > > Key: SPARK-16449 > URL: https://issues.apache.org/jira/browse/SPARK-16449 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 > Environment: AWS EMR, Jupyter notebook >Reporter: Jeff Levy >Priority: Minor > > Goal: Take the output from `describe` on a large DataFrame, then use a loop > to calculate `skewness` and `kurtosis` from pyspark.sql.functions for each > column, build them into a DataFrame of two rows, then use `unionAll` to merge > them together. > Issue: Despite having the same column names, in the same order with the same > dtypes, the `unionAll` fails with "Task not serializable". However, if I > build two test rows using dummy data then `unionAll` works fine. Also, if I > collect my results then turn them straight back into DataFrames, `unionAll` > succeeds. > Step-by-step code and output with comments can be seen here: > https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb > The issue appears to be in the way the loop in code block 6 is building the > rows before parallelizing, but the results look no different from the test > rows that do work. I reproduced this on multiple datasets, so downloading > the notebook and pointing it to any data of your own should replicate it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16409) regexp_extract with optional groups causes NPE
[ https://issues.apache.org/jira/browse/SPARK-16409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Moroz updated SPARK-16409: -- Description: df = sqlContext.createDataFrame([['c']], ['s']) df.select(F.regexp_extract('s', r'(a+)(b)?(c)', 2)).collect() causes NPE. Worse, in a large program it doesn't cause NPE instantly; it actually works fine, until some unpredictable (and inconsistent) moment in the future when (presumably) the invalid memory access occurs, and then it fails. For this reason, it took several hours to debug this. Suggestion: either fill the group with null; or raise exception immediately after examining the argument with a message that optional groups are not allowed. Traceback: --- Py4JJavaError Traceback (most recent call last) in () > 1 df.select(F.regexp_extract('s', r'(a+)(b)?(c)', 2)).collect() C:\Users\me\Downloads\spark-2.0.0-preview-bin-hadoop2.7\python\pyspark\sql\dataframe.py in collect(self) 294 """ 295 with SCCallSiteSync(self._sc) as css: --> 296 port = self._jdf.collectToPython() 297 return list(_load_from_socket(port, BatchedSerializer(PickleSerializer( 298 C:\Users\me\Downloads\spark-2.0.0-preview-bin-hadoop2.7\python\lib\py4j-0.10.1-src.zip\py4j\java_gateway.py in __call__(self, *args) 931 answer = self.gateway_client.send_command(command) 932 return_value = get_return_value( --> 933 answer, self.gateway_client, self.target_id, self.name) 934 935 for temp_arg in temp_args: C:\Users\me\Downloads\spark-2.0.0-preview-bin-hadoop2.7\python\pyspark\sql\utils.py in deco(*a, **kw) 55 def deco(*a, **kw): 56 try: ---> 57 return f(*a, **kw) 58 except py4j.protocol.Py4JJavaError as e: 59 s = e.java_exception.toString() C:\Users\me\Downloads\spark-2.0.0-preview-bin-hadoop2.7\python\lib\py4j-0.10.1-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name) 310 raise Py4JJavaError( 311 "An error occurred while calling {0}{1}{2}.\n". --> 312 format(target_id, ".", name), value) 313 else: 314 raise Py4JError( Py4JJavaError: An error occurred while calling o51.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:357) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:112) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:112) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:112) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:883) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:883) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1889) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1889) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[jira] [Comment Edited] (SPARK-16439) Incorrect information in SQL Query details
[ https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368056#comment-15368056 ] Dongjoon Hyun edited comment on SPARK-16439 at 7/8/16 5:50 PM: --- For me, it seems to work. https://app.box.com/s/lw0kl1ft7z4od9fwamtoafipzfrnex8m Could you provide more information? was (Author: dongjoon): For me, it works. https://app.box.com/representation/file_version_77719788413/image_2048/1.png Could you provide more information? > Incorrect information in SQL Query details > -- > > Key: SPARK-16439 > URL: https://issues.apache.org/jira/browse/SPARK-16439 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.0.0 >Reporter: Maciej Bryński > Attachments: spark.jpg > > > One picture is worth a thousand words. > Please see attachment > Incorrect values are in fields: > * data size > * number of output rows > * time to collect -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16409) regexp_extract with optional groups causes NPE
[ https://issues.apache.org/jira/browse/SPARK-16409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368059#comment-15368059 ] Max Moroz commented on SPARK-16409: --- [~srowen] So sorry I was sure I copied the entire code. I'm gonna update the issue with the full details. > regexp_extract with optional groups causes NPE > -- > > Key: SPARK-16409 > URL: https://issues.apache.org/jira/browse/SPARK-16409 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Max Moroz > > df.select(F.regexp_extract('s', r'(a+)(b)?(c)', 2)).collect() > causes NPE. Worse, in a large program it doesn't cause NPE instantly; it > actually works fine, until some unpredictable (and inconsistent) moment in > the future when (presumably) the invalid memory access occurs, and then it > fails. For this reason, it took several hours to debug this. > Suggestion: either fill the group with null; or raise exception immediately > after examining the argument with a message that optional groups are not > allowed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16439) Incorrect information in SQL Query details
[ https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368056#comment-15368056 ] Dongjoon Hyun commented on SPARK-16439: --- For me, it works. https://app.box.com/representation/file_version_77719788413/image_2048/1.png Could you provide more information? > Incorrect information in SQL Query details > -- > > Key: SPARK-16439 > URL: https://issues.apache.org/jira/browse/SPARK-16439 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.0.0 >Reporter: Maciej Bryński > Attachments: spark.jpg > > > One picture is worth a thousand words. > Please see attachment > Incorrect values are in fields: > * data size > * number of output rows > * time to collect -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16449) unionAll raises "Task not serializable"
[ https://issues.apache.org/jira/browse/SPARK-16449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Levy updated SPARK-16449: -- Description: Goal: Take the output from `describe` on a large DataFrame, then use a loop to calculate `skewness` and `kurtosis` from pyspark.sql.functions for each column, build them into a DataFrame of two rows, then use `unionAll` to merge them together. Issue: Despite having the same column names, in the same order with the same dtypes, the `unionAll` fails with "Task not serializable". However, if I build two test rows using dummy data then `unionAll` works fine. Also, if I collect my results then turn them straight back into DataFrames, `unionAll` succeeds. Step-by-step code and output with comments can be seen here: https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb The issue appears to be in the way the loop in code block 6 is building the rows before parallelizing, but the results look no different from the test rows that do work. was: Goal: Take the output from `describe` on a large DataFrame, then use a loop to calculate `skewness` and `kurtosis` from pyspark.sql.functions for each column, build them into a DataFrame of two rows, then use `unionAll` to merge them together. Issue: Despite having the same column names, in the same order with the same dtypes, the `unionAll` fails with "Task not serializable". However, if I build two test rows using dummy data then `unionAll` works fine. Also, if I collect my results then turn them straight back into DataFrames, `unionAll` succeeds. Step-by-step code and output with comments can be seen here: https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb > unionAll raises "Task not serializable" > --- > > Key: SPARK-16449 > URL: https://issues.apache.org/jira/browse/SPARK-16449 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 > Environment: AWS EMR, Jupyter notebook >Reporter: Jeff Levy >Priority: Minor > > Goal: Take the output from `describe` on a large DataFrame, then use a loop > to calculate `skewness` and `kurtosis` from pyspark.sql.functions for each > column, build them into a DataFrame of two rows, then use `unionAll` to merge > them together. > Issue: Despite having the same column names, in the same order with the same > dtypes, the `unionAll` fails with "Task not serializable". However, if I > build two test rows using dummy data then `unionAll` works fine. Also, if I > collect my results then turn them straight back into DataFrames, `unionAll` > succeeds. > Step-by-step code and output with comments can be seen here: > https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb > The issue appears to be in the way the loop in code block 6 is building the > rows before parallelizing, but the results look no different from the test > rows that do work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16444) Isotonic Regression wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368044#comment-15368044 ] Miao Wang commented on SPARK-16444: --- [~mengxr] Sure! I am done enough documents and examples last a couple of weeks. Glad to work something new. Thanks! > Isotonic Regression wrapper in SparkR > - > > Key: SPARK-16444 > URL: https://issues.apache.org/jira/browse/SPARK-16444 > Project: Spark > Issue Type: Sub-task > Components: MLlib, SparkR >Reporter: Xiangrui Meng > > Implement Isotonic Regression wrapper and other utils in SparkR. > {code} > spark.isotonicRegression(data, formula, ...) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16449) unionAll raises "Task not serializable"
Jeff Levy created SPARK-16449: - Summary: unionAll raises "Task not serializable" Key: SPARK-16449 URL: https://issues.apache.org/jira/browse/SPARK-16449 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.6.1 Environment: AWS EMR, Jupyter notebook Reporter: Jeff Levy Priority: Minor Goal: Take the output from `describe` on a large DataFrame, then use a loop to calculate `skewness` and `kurtosis` from pyspark.sql.functions for each column, build them into a DataFrame of two rows, then use `unionAll` to merge them together. Issue: Despite having the same column names, in the same order with the same dtypes, the `unionAll` fails with "Task not serializable". However, if I build two test rows using dummy data then `unionAll` works fine. Also, if I collect my results then turn them straight back into DataFrames, `unionAll` succeeds. Step-by-step code and output with comments can be seen here: https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15767) Decision Tree Regression wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-15767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15367997#comment-15367997 ] Kai Jiang commented on SPARK-15767: --- [~mengxr] Would you mind to give some comments on how to design this api? > Decision Tree Regression wrapper in SparkR > -- > > Key: SPARK-15767 > URL: https://issues.apache.org/jira/browse/SPARK-15767 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Kai Jiang >Assignee: Kai Jiang > > Implement a wrapper in SparkR to support decision tree regression. R's naive > Decision Tree Regression implementation is from package rpart with signature > rpart(formula, dataframe, method="anova"). I propose we could implement API > like spark.rpart(dataframe, formula, ...) . After having implemented > decision tree classification, we could refactor this two into an API more > like rpart() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15779) SQL context fails when Hive uses Tez as its default execution engine
[ https://issues.apache.org/jira/browse/SPARK-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15367977#comment-15367977 ] Jonathan Kelly commented on SPARK-15779: Is there a well-defined list of properties we should include in Spark's copy of hive-site.xml? > SQL context fails when Hive uses Tez as its default execution engine > > > Key: SPARK-15779 > URL: https://issues.apache.org/jira/browse/SPARK-15779 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Spark Submit, SQL >Affects Versions: 1.6.1 > Environment: Hadoop 2.7.2, Spark 1.6.1, Hive 2.0.1, Tez 0.8.3 >Reporter: Alexandre Linte > > By default, Hive uses MapReduce as its default execution engine. Since Hive > 2.0.0, MapReduce is deprecated. > To avoid this deprecation, I decided to use Tez instead of MapReduce as the > default execution engine. Unfortunately, this choice had an impact on Spark. > Now when I start Spark the SQL context fails with the following error: > {noformat} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 1.6.1 > /_/ > Using Scala version 2.10.5 (OpenJDK 64-Bit Server VM, Java 1.7.0_85) > Type in expressions to have them evaluated. > Type :help for more information. > Spark context available as sc. > java.lang.NoClassDefFoundError: org/apache/tez/dag/api/SessionNotRunning > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:529) > at > org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:204) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238) > at > org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218) > at > org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208) > at > org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:440) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:271) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028) > at $iwC$$iwC.(:15) > at $iwC.(:24) > at (:26) > at .(:30) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124) > at > org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324) > at > org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:124) > at > org.apa
[jira] [Commented] (SPARK-15804) Manually added metadata not saving with parquet
[ https://issues.apache.org/jira/browse/SPARK-15804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15367974#comment-15367974 ] Wenchen Fan commented on SPARK-15804: - this will be fixed by https://github.com/apache/spark/pull/14106 > Manually added metadata not saving with parquet > --- > > Key: SPARK-15804 > URL: https://issues.apache.org/jira/browse/SPARK-15804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Charlie Evans >Assignee: kevin yu > > Adding metadata with col().as(_, metadata) then saving the resultant > dataframe does not save the metadata. No error is thrown. Only see the schema > contains the metadata before saving and does not contain the metadata after > saving and loading the dataframe. Was working fine with 1.6.1. > {code} > case class TestRow(a: String, b: Int) > val rows = TestRow("a", 0) :: TestRow("b", 1) :: TestRow("c", 2) :: Nil > val df = spark.createDataFrame(rows) > import org.apache.spark.sql.types.MetadataBuilder > val md = new MetadataBuilder().putString("key", "value").build() > val dfWithMeta = df.select(col("a"), col("b").as("b", md)) > println(dfWithMeta.schema.json) > dfWithMeta.write.parquet("dfWithMeta") > val dfWithMeta2 = spark.read.parquet("dfWithMeta") > println(dfWithMeta2.schema.json) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16448) RemoveAliasOnlyProject should not remove alias with metadata
[ https://issues.apache.org/jira/browse/SPARK-16448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15367972#comment-15367972 ] Apache Spark commented on SPARK-16448: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/14106 > RemoveAliasOnlyProject should not remove alias with metadata > > > Key: SPARK-16448 > URL: https://issues.apache.org/jira/browse/SPARK-16448 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org