[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16501351#comment-16501351 ] Ismael Juma commented on SPARK-18057: - Yes, it is [~kabhwan]. The major version bump is simply because support for Java 7 has been dropped. > Update structured streaming kafka from 0.10.0.1 to 1.1.0 > > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Priority: Major > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24418) Upgrade to Scala 2.11.12
[ https://issues.apache.org/jira/browse/SPARK-24418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-24418: Description: Scala 2.11.12+ will support JDK9+. However, this is not going to be a simple version bump. *loadFIles()* in *ILoop* was removed in Scala 2.11.12. We use it as a hack to initialize the Spark before REPL sees any files. Issue filed in Scala community. https://github.com/scala/bug/issues/10913 was: Scala 2.11.12+ will support JDK9+. However, this is not goin to be a simple version bump. *loadFIles()* in *ILoop* was removed in Scala 2.11.12. We use it as a hack to initialize the Spark before REPL sees any files. Issue filed in Scala community. https://github.com/scala/bug/issues/10913 > Upgrade to Scala 2.11.12 > > > Key: SPARK-24418 > URL: https://issues.apache.org/jira/browse/SPARK-24418 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.3.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 2.4.0 > > > Scala 2.11.12+ will support JDK9+. However, this is not going to be a simple > version bump. > *loadFIles()* in *ILoop* was removed in Scala 2.11.12. We use it as a hack to > initialize the Spark before REPL sees any files. > Issue filed in Scala community. > https://github.com/scala/bug/issues/10913 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24418) Upgrade to Scala 2.11.12
[ https://issues.apache.org/jira/browse/SPARK-24418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24418: Assignee: DB Tsai (was: Apache Spark) > Upgrade to Scala 2.11.12 > > > Key: SPARK-24418 > URL: https://issues.apache.org/jira/browse/SPARK-24418 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.3.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 2.4.0 > > > Scala 2.11.12+ will support JDK9+. However, this is not goin to be a simple > version bump. > *loadFIles()* in *ILoop* was removed in Scala 2.11.12. We use it as a hack to > initialize the Spark before REPL sees any files. > Issue filed in Scala community. > https://github.com/scala/bug/issues/10913 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24418) Upgrade to Scala 2.11.12
[ https://issues.apache.org/jira/browse/SPARK-24418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24418: Assignee: Apache Spark (was: DB Tsai) > Upgrade to Scala 2.11.12 > > > Key: SPARK-24418 > URL: https://issues.apache.org/jira/browse/SPARK-24418 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.3.0 >Reporter: DB Tsai >Assignee: Apache Spark >Priority: Major > Fix For: 2.4.0 > > > Scala 2.11.12+ will support JDK9+. However, this is not goin to be a simple > version bump. > *loadFIles()* in *ILoop* was removed in Scala 2.11.12. We use it as a hack to > initialize the Spark before REPL sees any files. > Issue filed in Scala community. > https://github.com/scala/bug/issues/10913 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24418) Upgrade to Scala 2.11.12
[ https://issues.apache.org/jira/browse/SPARK-24418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16501337#comment-16501337 ] Apache Spark commented on SPARK-24418: -- User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/21495 > Upgrade to Scala 2.11.12 > > > Key: SPARK-24418 > URL: https://issues.apache.org/jira/browse/SPARK-24418 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.3.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 2.4.0 > > > Scala 2.11.12+ will support JDK9+. However, this is not goin to be a simple > version bump. > *loadFIles()* in *ILoop* was removed in Scala 2.11.12. We use it as a hack to > initialize the Spark before REPL sees any files. > Issue filed in Scala community. > https://github.com/scala/bug/issues/10913 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16501334#comment-16501334 ] Jungtaek Lim commented on SPARK-18057: -- Is Kafka 2.0.0 client compatible with Kafka 1.x and 0.10.x brokers? I guess end users might hesitate to use the latest version in production, especially the major version is changed. Supporting broker version range is the most important thing to consider while upgrading. > Update structured streaming kafka from 0.10.0.1 to 1.1.0 > > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Priority: Major > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24467) VectorAssemblerEstimator
Joseph K. Bradley created SPARK-24467: - Summary: VectorAssemblerEstimator Key: SPARK-24467 URL: https://issues.apache.org/jira/browse/SPARK-24467 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.4.0 Reporter: Joseph K. Bradley In [SPARK-22346], I believe I made a wrong API decision: I recommended added `VectorSizeHint` instead of making `VectorAssembler` into an Estimator since I thought the latter option would break most workflows. However, I should have proposed: * Add a Param to VectorAssembler for specifying the sizes of Vectors in the inputCols. This Param can be optional. If not given, then VectorAssembler will behave as it does now. If given, then VectorAssembler can use that info instead of figuring out the Vector sizes via metadata or examining Rows in the data (though it could do consistency checks). * Add a VectorAssemblerEstimator which gets the Vector lengths from data and produces a VectorAssembler with the vector lengths Param specified. This will not break existing workflows. Migrating to VectorAssemblerEstimator will be easier than adding VectorSizeHint since it will not require users to manually input Vector lengths. Note: Even with this Estimator, VectorSizeHint might prove useful for other things in the future which require vector length metadata, so we could consider keeping it rather than deprecating it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19826) spark.ml Python API for PIC
[ https://issues.apache.org/jira/browse/SPARK-19826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-19826: -- Shepherd: Weichen Xu > spark.ml Python API for PIC > --- > > Key: SPARK-19826 > URL: https://issues.apache.org/jira/browse/SPARK-19826 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Huaxin Gao >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24374) SPIP: Support Barrier Scheduling in Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-24374: -- Epic Name: Support Barrier Execution Mode > SPIP: Support Barrier Scheduling in Apache Spark > > > Key: SPARK-24374 > URL: https://issues.apache.org/jira/browse/SPARK-24374 > Project: Spark > Issue Type: Epic > Components: ML, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: SPIP > Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf > > > (See details in the linked/attached SPIP doc.) > {quote} > The proposal here is to add a new scheduling model to Apache Spark so users > can properly embed distributed DL training as a Spark stage to simplify the > distributed training workflow. For example, Horovod uses MPI to implement > all-reduce to accelerate distributed TensorFlow training. The computation > model is different from MapReduce used by Spark. In Spark, a task in a stage > doesn’t depend on any other tasks in the same stage, and hence it can be > scheduled independently. In MPI, all workers start at the same time and pass > messages around. To embed this workload in Spark, we need to introduce a new > scheduling model, tentatively named “barrier scheduling”, which launches > tasks at the same time and provides users enough information and tooling to > embed distributed DL training. Spark can also provide an extra layer of fault > tolerance in case some tasks failed in the middle, where Spark would abort > all tasks and restart the stage. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19826) spark.ml Python API for PIC
[ https://issues.apache.org/jira/browse/SPARK-19826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-19826: - Assignee: Huaxin Gao > spark.ml Python API for PIC > --- > > Key: SPARK-19826 > URL: https://issues.apache.org/jira/browse/SPARK-19826 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Huaxin Gao >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15784) Add Power Iteration Clustering to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-15784. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21493 [https://github.com/apache/spark/pull/21493] > Add Power Iteration Clustering to spark.ml > -- > > Key: SPARK-15784 > URL: https://issues.apache.org/jira/browse/SPARK-15784 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xinh Huynh >Assignee: Miao Wang >Priority: Major > Fix For: 2.4.0 > > > Adding this algorithm is required as part of SPARK-4591: Algorithm/model > parity for spark.ml. The review JIRA for clustering is SPARK-14380. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16501254#comment-16501254 ] Hossein Falaki edited comment on SPARK-24359 at 6/5/18 3:51 AM: [~shivaram] what prevents us from creating a tag like SparkML-2.4.0.1 and SparkML-2.4.0.2 (or some other variant like that) in the main Spark repo? Also, if you think initially this will be unclear, we don't have to submit SparkML to CRAN in its first release. Similar to SparkR we can wait a bit until we are confident about its compatibility. Many users and distributions, distribute SparkR from Apache rather than CRAN (one example is Databricks. We build SparkR from source) and they would be able to give us feedback while the project is in alpha state. was (Author: falaki): [~shivaram] what prevents us from creating a tag like SparkML-2.4.0.1 and SparkML-2.4.0.2 (or some other variant like that) in the main Spark repo? Also, if you think initially this will be unclear, we don't have to submit SparkML to CRAN in its first release. Similar to SparkR we can wait a bit until we are confident about its compatibility. Many users and distributions, distribute SparkR from Apache rather than CRAN. One example is Databricks. We build SparkR from source. > SPIP: ML Pipelines in R > --- > > Key: SPARK-24359 > URL: https://issues.apache.org/jira/browse/SPARK-24359 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hossein Falaki >Priority: Major > Labels: SPIP > Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines > in R-v3.pdf, SparkML_ ML Pipelines in R.pdf > > > h1. Background and motivation > SparkR supports calling MLlib functionality with an [R-friendly > API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. > Since Spark 1.5 the (new) SparkML API which is based on [pipelines and > parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] > has matured significantly. It allows users build and maintain complicated > machine learning pipelines. A lot of this functionality is difficult to > expose using the simple formula-based API in SparkR. > We propose a new R package, _SparkML_, to be distributed along with SparkR as > part of Apache Spark. This new package will be built on top of SparkR’s APIs > to expose SparkML’s pipeline APIs and functionality. > *Why not SparkR?* > SparkR package contains ~300 functions. Many of these shadow functions in > base and other popular CRAN packages. We think adding more functions to > SparkR will degrade usability and make maintenance harder. > *Why not sparklyr?* > sparklyr is an R package developed by RStudio Inc. to expose Spark API to R > users. sparklyr includes MLlib API wrappers, but to the best of our knowledge > they are not comprehensive. Also we propose a code-gen approach for this > package to minimize work needed to expose future MLlib API, but sparklyr’s > API is manually written. > h1. Target Personas > * Existing SparkR users who need more flexible SparkML API > * R users (data scientists, statisticians) who wish to build Spark ML > pipelines in R > h1. Goals > * R users can install SparkML from CRAN > * R users will be able to import SparkML independent from SparkR > * After setting up a Spark session R users can > ** create a pipeline by chaining individual components and specifying their > parameters > ** tune a pipeline in parallel, taking advantage of Spark > ** inspect a pipeline’s parameters and evaluation metrics > ** repeatedly apply a pipeline > * MLlib contributors can easily add R wrappers for new MLlib Estimators and > Transformers > h1. Non-Goals > * Adding new algorithms to SparkML R package which do not exist in Scala > * Parallelizing existing CRAN packages > * Changing existing SparkR ML wrapping API > h1. Proposed API Changes > h2. Design goals > When encountering trade-offs in API, we will chose based on the following > list of priorities. The API choice that addresses a higher priority goal will > be chosen. > # *Comprehensive coverage of MLlib API:* Design choices that make R coverage > of future ML algorithms difficult will be ruled out. > * *Semantic clarity*: We attempt to minimize confusion with other packages. > Between consciousness and clarity, we will choose clarity. > * *Maintainability and testability:* API choices that require manual > maintenance or make testing difficult should be avoided. > * *Interoperability with rest of Spark components:* We will keep the R API > as thin as possible and keep all functionality implementation in JVM/Scala. > * *Being natural to R users:* Ultimate users of this package are R use
[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16501254#comment-16501254 ] Hossein Falaki commented on SPARK-24359: [~shivaram] what prevents us from creating a tag like SparkML-2.4.0.1 and SparkML-2.4.0.2 (or some other variant like that) in the main Spark repo? Also, if you think initially this will be unclear, we don't have to submit SparkML to CRAN in its first release. Similar to SparkR we can wait a bit until we are confident about its compatibility. Many users and distributions, distribute SparkR from Apache rather than CRAN. One example is Databricks. We build SparkR from source. > SPIP: ML Pipelines in R > --- > > Key: SPARK-24359 > URL: https://issues.apache.org/jira/browse/SPARK-24359 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hossein Falaki >Priority: Major > Labels: SPIP > Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines > in R-v3.pdf, SparkML_ ML Pipelines in R.pdf > > > h1. Background and motivation > SparkR supports calling MLlib functionality with an [R-friendly > API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. > Since Spark 1.5 the (new) SparkML API which is based on [pipelines and > parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] > has matured significantly. It allows users build and maintain complicated > machine learning pipelines. A lot of this functionality is difficult to > expose using the simple formula-based API in SparkR. > We propose a new R package, _SparkML_, to be distributed along with SparkR as > part of Apache Spark. This new package will be built on top of SparkR’s APIs > to expose SparkML’s pipeline APIs and functionality. > *Why not SparkR?* > SparkR package contains ~300 functions. Many of these shadow functions in > base and other popular CRAN packages. We think adding more functions to > SparkR will degrade usability and make maintenance harder. > *Why not sparklyr?* > sparklyr is an R package developed by RStudio Inc. to expose Spark API to R > users. sparklyr includes MLlib API wrappers, but to the best of our knowledge > they are not comprehensive. Also we propose a code-gen approach for this > package to minimize work needed to expose future MLlib API, but sparklyr’s > API is manually written. > h1. Target Personas > * Existing SparkR users who need more flexible SparkML API > * R users (data scientists, statisticians) who wish to build Spark ML > pipelines in R > h1. Goals > * R users can install SparkML from CRAN > * R users will be able to import SparkML independent from SparkR > * After setting up a Spark session R users can > ** create a pipeline by chaining individual components and specifying their > parameters > ** tune a pipeline in parallel, taking advantage of Spark > ** inspect a pipeline’s parameters and evaluation metrics > ** repeatedly apply a pipeline > * MLlib contributors can easily add R wrappers for new MLlib Estimators and > Transformers > h1. Non-Goals > * Adding new algorithms to SparkML R package which do not exist in Scala > * Parallelizing existing CRAN packages > * Changing existing SparkR ML wrapping API > h1. Proposed API Changes > h2. Design goals > When encountering trade-offs in API, we will chose based on the following > list of priorities. The API choice that addresses a higher priority goal will > be chosen. > # *Comprehensive coverage of MLlib API:* Design choices that make R coverage > of future ML algorithms difficult will be ruled out. > * *Semantic clarity*: We attempt to minimize confusion with other packages. > Between consciousness and clarity, we will choose clarity. > * *Maintainability and testability:* API choices that require manual > maintenance or make testing difficult should be avoided. > * *Interoperability with rest of Spark components:* We will keep the R API > as thin as possible and keep all functionality implementation in JVM/Scala. > * *Being natural to R users:* Ultimate users of this package are R users and > they should find it easy and natural to use. > The API will follow familiar R function syntax, where the object is passed as > the first argument of the method: do_something(obj, arg1, arg2). All > functions are snake_case (e.g., {{spark_logistic_regression()}} and > {{set_max_iter()}}). If a constructor gets arguments, they will be named > arguments. For example: > {code:java} > > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), > > 0.1){code} > When calls need to be chained, like above example, syntax can nicely > translate to a natural pipeline style with help from very popular[ magrittr >
[jira] [Updated] (SPARK-21302) history server WebUI show HTTP ERROR 500
[ https://issues.apache.org/jira/browse/SPARK-21302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Pan updated SPARK-21302: -- Description: When navigate to history server WebUI, and check incomplete applications, show http 500 Error logs: 17/07/05 20:17:44 INFO ApplicationCacheCheckFilter: Application Attempt app-20170705201715-0005-0ce78623-38db-4d23-a2b2-8cb45bb3f505/None updated; refreshing 17/07/05 20:17:44 WARN ServletHandler: /history/app-20170705201715-0005-0ce78623-38db-4d23-a2b2-8cb45bb3f505/executors/ java.lang.NullPointerException at org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) at org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) at org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) at org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) at org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) at org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) at org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.spark_project.jetty.server.Server.handle(Server.java:499) at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311) at org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) at org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544) at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) at org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) at java.lang.Thread.run(Thread.java:785) 17/07/05 20:18:00 WARN ServletHandler: / java.lang.NullPointerException at org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) at org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) at org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) at org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) at org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) at org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:479) at org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) at org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.spark_project.jetty.server.Server.handle(Server.java:499) at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311) at org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) at org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544) at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) at org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) at java.lang.Thread.run(Thread.java:785) 17/07/05 20:18:17 WARN ServletHandler: / java.lang.NullPointerException at org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) at org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) at org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) was: When navigate to history server WebUI, and check incomplete applications, show http 500 Error logs: 17/07/05 20:17:44 INFO ApplicationCacheCheckFilter: Application Attempt app-20170705201715-0005-0ce78623-38db-4d23-a2b2-8cb45bb3f505/None updated; refreshing 17/07/05 20:17:44 WARN ServletHandler: /history/app-20170705201715-0005-0ce78623-38db-4d23-a2b2-8cb45bb3f505/executors/ java.lang.NullPointerException at org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) at org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) at org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) at org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) at org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) at org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) at org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.spark_project.jetty.server.Server.handle(Se
[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive
[ https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16501225#comment-16501225 ] Saisai Shao commented on SPARK-20202: - OK, for the 1st, I've already started working on it locally. Looks like it is not a big change, only some POM changes are enough, I will submit a patch to Hive community. > Remove references to org.spark-project.hive > --- > > Key: SPARK-20202 > URL: https://issues.apache.org/jira/browse/SPARK-20202 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.6.4, 2.0.3, 2.1.1 >Reporter: Owen O'Malley >Priority: Major > > Spark can't continue to depend on their fork of Hive and must move to > standard Hive versions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24466) TextSocketMicroBatchReader no longer works with nc utility
[ https://issues.apache.org/jira/browse/SPARK-24466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16501216#comment-16501216 ] Jungtaek Lim commented on SPARK-24466: -- I'm working on this. Will provide the patch sooner. > TextSocketMicroBatchReader no longer works with nc utility > -- > > Key: SPARK-24466 > URL: https://issues.apache.org/jira/browse/SPARK-24466 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > While playing with Spark 2.4.0-SNAPSHOT, I found nc command exits before > reading actual data so the query also exits with error. > > The reason is due to launching temporary reader for reading schema, and > closing reader, and re-opening reader. While reliable socket server should be > able to handle this without any issue, nc command normally can't handle > multiple connections and simply exits when closing temporary reader. > > Given that socket source is expected to be used from examples on official > document or some experiments, which we tend to simply use netcat, this is > better to be treated as bug, though this is a kind of limitation on netcat. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24466) TextSocketMicroBatchReader no longer works with nc utility
Jungtaek Lim created SPARK-24466: Summary: TextSocketMicroBatchReader no longer works with nc utility Key: SPARK-24466 URL: https://issues.apache.org/jira/browse/SPARK-24466 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.4.0 Reporter: Jungtaek Lim While playing with Spark 2.4.0-SNAPSHOT, I found nc command exits before reading actual data so the query also exits with error. The reason is due to launching temporary reader for reading schema, and closing reader, and re-opening reader. While reliable socket server should be able to handle this without any issue, nc command normally can't handle multiple connections and simply exits when closing temporary reader. Given that socket source is expected to be used from examples on official document or some experiments, which we tend to simply use netcat, this is better to be treated as bug, though this is a kind of limitation on netcat. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24403) reuse r worker
[ https://issues.apache.org/jira/browse/SPARK-24403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-24403. -- Resolution: Duplicate > reuse r worker > -- > > Key: SPARK-24403 > URL: https://issues.apache.org/jira/browse/SPARK-24403 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.3.0 >Reporter: Deepansh >Priority: Major > Labels: sparkR > > Currently, SparkR doesn't support reuse of its workers, so broadcast and > closure are transferred to workers each time. Can we add the idea of python > worker reuse to SparkR also, to enhance its performance? > performance issues reference > [https://issues.apache.org/jira/browse/SPARK-23650] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-24403) reuse r worker
[ https://issues.apache.org/jira/browse/SPARK-24403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung closed SPARK-24403. > reuse r worker > -- > > Key: SPARK-24403 > URL: https://issues.apache.org/jira/browse/SPARK-24403 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.3.0 >Reporter: Deepansh >Priority: Major > Labels: sparkR > > Currently, SparkR doesn't support reuse of its workers, so broadcast and > closure are transferred to workers each time. Can we add the idea of python > worker reuse to SparkR also, to enhance its performance? > performance issues reference > [https://issues.apache.org/jira/browse/SPARK-23650] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16451) Spark-shell / pyspark should finish gracefully when "SaslException: GSS initiate failed" is hit
[ https://issues.apache.org/jira/browse/SPARK-16451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-16451. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21368 [https://github.com/apache/spark/pull/21368] > Spark-shell / pyspark should finish gracefully when "SaslException: GSS > initiate failed" is hit > --- > > Key: SPARK-16451 > URL: https://issues.apache.org/jira/browse/SPARK-16451 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1 >Reporter: Yesha Vora >Assignee: Marcelo Vanzin >Priority: Major > Fix For: 2.4.0 > > > Steps to reproduce: (secure cluster) > * kdestroy > * spark-shell --master yarn-client > If no valid keytab is set while running spark-shell/pyspark, the spark client > never exits. It keep printing below error messages. > spark-client should call shutdown hook immediately and exit with proper error > code. > Currently, user need to explicitly shutdown process. (using cntrl+c) > {code} > 16/07/08 20:53:10 WARN Client: Exception encountered while connecting to the > server : > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:413) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:595) > at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:397) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:761) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:757) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:756) > at org.apache.hadoop.ipc.Client$Connection.access$3200(Client.java:397) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1617) > at org.apache.hadoop.ipc.Client.call(Client.java:1448) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at com.sun.proxy.$Proxy25.getFileInfo(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:816) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > at com.sun.proxy.$Proxy26.getFileInfo(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2151) > at > org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1408) > at > org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1404) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1404) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1437) > at > org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter.(FileSystemTimelineWriter.java:124) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.createTimelineWriter(TimelineClientImpl.java:316) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceStart(TimelineClientImpl.java:308) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStart(YarnClientImpl.java:194) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:127) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.s
[jira] [Assigned] (SPARK-16451) Spark-shell / pyspark should finish gracefully when "SaslException: GSS initiate failed" is hit
[ https://issues.apache.org/jira/browse/SPARK-16451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-16451: Assignee: Marcelo Vanzin > Spark-shell / pyspark should finish gracefully when "SaslException: GSS > initiate failed" is hit > --- > > Key: SPARK-16451 > URL: https://issues.apache.org/jira/browse/SPARK-16451 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1 >Reporter: Yesha Vora >Assignee: Marcelo Vanzin >Priority: Major > > Steps to reproduce: (secure cluster) > * kdestroy > * spark-shell --master yarn-client > If no valid keytab is set while running spark-shell/pyspark, the spark client > never exits. It keep printing below error messages. > spark-client should call shutdown hook immediately and exit with proper error > code. > Currently, user need to explicitly shutdown process. (using cntrl+c) > {code} > 16/07/08 20:53:10 WARN Client: Exception encountered while connecting to the > server : > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:413) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:595) > at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:397) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:761) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:757) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:756) > at org.apache.hadoop.ipc.Client$Connection.access$3200(Client.java:397) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1617) > at org.apache.hadoop.ipc.Client.call(Client.java:1448) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at com.sun.proxy.$Proxy25.getFileInfo(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:816) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > at com.sun.proxy.$Proxy26.getFileInfo(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2151) > at > org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1408) > at > org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1404) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1404) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1437) > at > org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter.(FileSystemTimelineWriter.java:124) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.createTimelineWriter(TimelineClientImpl.java:316) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceStart(TimelineClientImpl.java:308) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStart(YarnClientImpl.java:194) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:127) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144) > at org.apache.spark.SparkContext.(SparkContext.scala:530) > at > org.apache.spark.repl.SparkILoop.createS
[jira] [Assigned] (SPARK-24215) Implement __repr__ and _repr_html_ for dataframes in PySpark
[ https://issues.apache.org/jira/browse/SPARK-24215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-24215: Assignee: Li Yuanjian > Implement __repr__ and _repr_html_ for dataframes in PySpark > > > Key: SPARK-24215 > URL: https://issues.apache.org/jira/browse/SPARK-24215 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Assignee: Li Yuanjian >Priority: Major > Fix For: 2.4.0 > > > To help people that are new to Spark get feedback more easily, we should > implement the repr methods for Jupyter python kernels. That way, when users > run pyspark in jupyter console or notebooks, they get good feedback about the > queries they've defined. > This should include an option for eager evaluation, (maybe > spark.jupyter.eager-eval?). When set, the formatting methods would run > dataframes and produce output like {{show}}. This is a good balance between > not hiding Spark's action behavior and getting feedback to users that don't > know to call actions. > Here's the dev list thread for context: > http://apache-spark-developers-list.1001551.n3.nabble.com/eager-execution-and-debuggability-td23928.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24215) Implement __repr__ and _repr_html_ for dataframes in PySpark
[ https://issues.apache.org/jira/browse/SPARK-24215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24215. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21370 [https://github.com/apache/spark/pull/21370] > Implement __repr__ and _repr_html_ for dataframes in PySpark > > > Key: SPARK-24215 > URL: https://issues.apache.org/jira/browse/SPARK-24215 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Assignee: Li Yuanjian >Priority: Major > Fix For: 2.4.0 > > > To help people that are new to Spark get feedback more easily, we should > implement the repr methods for Jupyter python kernels. That way, when users > run pyspark in jupyter console or notebooks, they get good feedback about the > queries they've defined. > This should include an option for eager evaluation, (maybe > spark.jupyter.eager-eval?). When set, the formatting methods would run > dataframes and produce output like {{show}}. This is a good balance between > not hiding Spark's action behavior and getting feedback to users that don't > know to call actions. > Here's the dev list thread for context: > http://apache-spark-developers-list.1001551.n3.nabble.com/eager-execution-and-debuggability-td23928.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24465) LSHModel should support Structured Streaming for transform
[ https://issues.apache.org/jira/browse/SPARK-24465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-24465: -- Description: Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, MinHashLSHModel) are not compatible with Structured Streaming (and I believe are the final Transformers which are not compatible). These do not work because Spark SQL does not support nested types containing UDTs; see [SPARK-12878]. This task is to add unit tests for streaming (as in [SPARK-22644]) for LSHModels after [SPARK-12878] has been fixed. > LSHModel should support Structured Streaming for transform > -- > > Key: SPARK-24465 > URL: https://issues.apache.org/jira/browse/SPARK-24465 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Priority: Major > > Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, > MinHashLSHModel) are not compatible with Structured Streaming (and I believe > are the final Transformers which are not compatible). These do not work > because Spark SQL does not support nested types containing UDTs; see > [SPARK-12878]. > This task is to add unit tests for streaming (as in [SPARK-22644]) for > LSHModels after [SPARK-12878] has been fixed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24465) LSHModel should support Structured Streaming for transform
Joseph K. Bradley created SPARK-24465: - Summary: LSHModel should support Structured Streaming for transform Key: SPARK-24465 URL: https://issues.apache.org/jira/browse/SPARK-24465 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.4.0 Environment: Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, MinHashLSHModel) are not compatible with Structured Streaming (and I believe are the final Transformers which are not compatible). These do not work because Spark SQL does not support nested types containing UDTs; see [SPARK-12878]. This task is to add unit tests for streaming (as in [SPARK-22644]) for LSHModels after [SPARK-12878] has been fixed. Reporter: Joseph K. Bradley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24465) LSHModel should support Structured Streaming for transform
[ https://issues.apache.org/jira/browse/SPARK-24465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-24465: -- Environment: (was: Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, MinHashLSHModel) are not compatible with Structured Streaming (and I believe are the final Transformers which are not compatible). These do not work because Spark SQL does not support nested types containing UDTs; see [SPARK-12878]. This task is to add unit tests for streaming (as in [SPARK-22644]) for LSHModels after [SPARK-12878] has been fixed.) > LSHModel should support Structured Streaming for transform > -- > > Key: SPARK-24465 > URL: https://issues.apache.org/jira/browse/SPARK-24465 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24375) Design sketch: support barrier scheduling in Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16501049#comment-16501049 ] Apache Spark commented on SPARK-24375: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/21494 > Design sketch: support barrier scheduling in Apache Spark > - > > Key: SPARK-24375 > URL: https://issues.apache.org/jira/browse/SPARK-24375 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Jiang Xingbo >Priority: Major > > This task is to outline a design sketch for the barrier scheduling SPIP > discussion. It doesn't need to be a complete design before the vote. But it > should at least cover both Scala/Java and PySpark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24375) Design sketch: support barrier scheduling in Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24375: Assignee: Jiang Xingbo (was: Apache Spark) > Design sketch: support barrier scheduling in Apache Spark > - > > Key: SPARK-24375 > URL: https://issues.apache.org/jira/browse/SPARK-24375 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Jiang Xingbo >Priority: Major > > This task is to outline a design sketch for the barrier scheduling SPIP > discussion. It doesn't need to be a complete design before the vote. But it > should at least cover both Scala/Java and PySpark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24375) Design sketch: support barrier scheduling in Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24375: Assignee: Apache Spark (was: Jiang Xingbo) > Design sketch: support barrier scheduling in Apache Spark > - > > Key: SPARK-24375 > URL: https://issues.apache.org/jira/browse/SPARK-24375 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Apache Spark >Priority: Major > > This task is to outline a design sketch for the barrier scheduling SPIP > discussion. It doesn't need to be a complete design before the vote. But it > should at least cover both Scala/Java and PySpark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16501037#comment-16501037 ] Shivaram Venkataraman commented on SPARK-24359: --- If you have a separate repo that makes it much more cleaner to tag SparkML releases and test that it works with the existing Spark releases. Say by tagging them as 2.4.0.1 and 2.4.0.2 etc. for every small change that needs to be made on the R side. If they are in the same repo then the tag will apply to all other Spark changes at that point making it harder to separate out just the R changes that went into this tag. Also this separate repo does not need to be permanent. If we find that the package is stable on CRAN then we can move it back into the main repo. I just think for the first few releases on CRAN it'll be much more easier if its not tied to Spark releases. > SPIP: ML Pipelines in R > --- > > Key: SPARK-24359 > URL: https://issues.apache.org/jira/browse/SPARK-24359 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hossein Falaki >Priority: Major > Labels: SPIP > Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines > in R-v3.pdf, SparkML_ ML Pipelines in R.pdf > > > h1. Background and motivation > SparkR supports calling MLlib functionality with an [R-friendly > API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. > Since Spark 1.5 the (new) SparkML API which is based on [pipelines and > parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] > has matured significantly. It allows users build and maintain complicated > machine learning pipelines. A lot of this functionality is difficult to > expose using the simple formula-based API in SparkR. > We propose a new R package, _SparkML_, to be distributed along with SparkR as > part of Apache Spark. This new package will be built on top of SparkR’s APIs > to expose SparkML’s pipeline APIs and functionality. > *Why not SparkR?* > SparkR package contains ~300 functions. Many of these shadow functions in > base and other popular CRAN packages. We think adding more functions to > SparkR will degrade usability and make maintenance harder. > *Why not sparklyr?* > sparklyr is an R package developed by RStudio Inc. to expose Spark API to R > users. sparklyr includes MLlib API wrappers, but to the best of our knowledge > they are not comprehensive. Also we propose a code-gen approach for this > package to minimize work needed to expose future MLlib API, but sparklyr’s > API is manually written. > h1. Target Personas > * Existing SparkR users who need more flexible SparkML API > * R users (data scientists, statisticians) who wish to build Spark ML > pipelines in R > h1. Goals > * R users can install SparkML from CRAN > * R users will be able to import SparkML independent from SparkR > * After setting up a Spark session R users can > ** create a pipeline by chaining individual components and specifying their > parameters > ** tune a pipeline in parallel, taking advantage of Spark > ** inspect a pipeline’s parameters and evaluation metrics > ** repeatedly apply a pipeline > * MLlib contributors can easily add R wrappers for new MLlib Estimators and > Transformers > h1. Non-Goals > * Adding new algorithms to SparkML R package which do not exist in Scala > * Parallelizing existing CRAN packages > * Changing existing SparkR ML wrapping API > h1. Proposed API Changes > h2. Design goals > When encountering trade-offs in API, we will chose based on the following > list of priorities. The API choice that addresses a higher priority goal will > be chosen. > # *Comprehensive coverage of MLlib API:* Design choices that make R coverage > of future ML algorithms difficult will be ruled out. > * *Semantic clarity*: We attempt to minimize confusion with other packages. > Between consciousness and clarity, we will choose clarity. > * *Maintainability and testability:* API choices that require manual > maintenance or make testing difficult should be avoided. > * *Interoperability with rest of Spark components:* We will keep the R API > as thin as possible and keep all functionality implementation in JVM/Scala. > * *Being natural to R users:* Ultimate users of this package are R users and > they should find it easy and natural to use. > The API will follow familiar R function syntax, where the object is passed as > the first argument of the method: do_something(obj, arg1, arg2). All > functions are snake_case (e.g., {{spark_logistic_regression()}} and > {{set_max_iter()}}). If a constructor gets arguments, they will be named > arguments. For example: > {code:java} > > lr <- set_reg_param(set
[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16501017#comment-16501017 ] Apache Spark commented on SPARK-15784: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/21493 > Add Power Iteration Clustering to spark.ml > -- > > Key: SPARK-15784 > URL: https://issues.apache.org/jira/browse/SPARK-15784 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xinh Huynh >Assignee: Miao Wang >Priority: Major > > Adding this algorithm is required as part of SPARK-4591: Algorithm/model > parity for spark.ml. The review JIRA for clustering is SPARK-14380. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16501020#comment-16501020 ] Weichen Xu commented on SPARK-15784: [~wm624] Thanks for your enthusiasm, but we need this to be done ASAP, so I create a PR. > Add Power Iteration Clustering to spark.ml > -- > > Key: SPARK-15784 > URL: https://issues.apache.org/jira/browse/SPARK-15784 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xinh Huynh >Assignee: Miao Wang >Priority: Major > > Adding this algorithm is required as part of SPARK-4591: Algorithm/model > parity for spark.ml. The review JIRA for clustering is SPARK-14380. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24300) generateLDAData in ml.cluster.LDASuite didn't set seed correctly
[ https://issues.apache.org/jira/browse/SPARK-24300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-24300. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21492 [https://github.com/apache/spark/pull/21492] > generateLDAData in ml.cluster.LDASuite didn't set seed correctly > > > Key: SPARK-24300 > URL: https://issues.apache.org/jira/browse/SPARK-24300 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Xiangrui Meng >Assignee: Lu Wang >Priority: Minor > Fix For: 2.4.0 > > > [https://github.com/apache/spark/blob/0d63ebd17df747fb41d7ba254718bb7af3ae/mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala] > > generateLDAData uses the same RNG in all partitions to generate random data. > This either causes duplicate rows in cluster mode or indeterministic behavior > in local mode: > {code:java} > scala> val rng = new java.util.Random(10) > rng: java.util.Random = java.util.Random@78c5ef58 > scala> sc.parallelize(1 to 10).map { i => Seq.fill(10)(rng.nextInt(10)) > }.collect().mkString("\n") > res12: String = > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8){code} > We should create one RNG per partition to make it safe. > > cc: [~lu.DB] [~josephkb] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24464) Unit tests for MLlib's Instrumentation
Xiangrui Meng created SPARK-24464: - Summary: Unit tests for MLlib's Instrumentation Key: SPARK-24464 URL: https://issues.apache.org/jira/browse/SPARK-24464 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.3.0 Reporter: Xiangrui Meng We added Instrumentation to MLlib to log params and metrics during machine learning training and inference. However, the code has zero test coverage, which usually means bugs and regressions in the future. I created this JIRA to discuss how we should test Instrumentation. cc: [~thunterdb] [~josephkb] [~lu.DB] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24463) Add catalyst rule to reorder TypedFilters separated by Filters to reduce serde operations
Lior Regev created SPARK-24463: -- Summary: Add catalyst rule to reorder TypedFilters separated by Filters to reduce serde operations Key: SPARK-24463 URL: https://issues.apache.org/jira/browse/SPARK-24463 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Lior Regev -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19860) DataFrame join get conflict error if two frames has a same name column.
[ https://issues.apache.org/jira/browse/SPARK-19860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500949#comment-16500949 ] Jiachen Yang commented on SPARK-19860: -- May I know why this happens? I come across the same problem. > DataFrame join get conflict error if two frames has a same name column. > --- > > Key: SPARK-19860 > URL: https://issues.apache.org/jira/browse/SPARK-19860 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: wuchang >Priority: Major > > {code} > >>> print df1.collect() > [Row(fdate=u'20170223', in_amount1=7758588), Row(fdate=u'20170302', > in_amount1=7656414), Row(fdate=u'20170207', in_amount1=7836305), > Row(fdate=u'20170208', in_amount1=14887432), Row(fdate=u'20170224', > in_amount1=16506043), Row(fdate=u'20170201', in_amount1=7339381), > Row(fdate=u'20170221', in_amount1=7490447), Row(fdate=u'20170303', > in_amount1=11142114), Row(fdate=u'20170202', in_amount1=7882746), > Row(fdate=u'20170306', in_amount1=12977822), Row(fdate=u'20170227', > in_amount1=15480688), Row(fdate=u'20170206', in_amount1=11370812), > Row(fdate=u'20170217', in_amount1=8208985), Row(fdate=u'20170203', > in_amount1=8175477), Row(fdate=u'20170222', in_amount1=11032303), > Row(fdate=u'20170216', in_amount1=11986702), Row(fdate=u'20170209', > in_amount1=9082380), Row(fdate=u'20170214', in_amount1=8142569), > Row(fdate=u'20170307', in_amount1=11092829), Row(fdate=u'20170213', > in_amount1=12341887), Row(fdate=u'20170228', in_amount1=13966203), > Row(fdate=u'20170220', in_amount1=9397558), Row(fdate=u'20170210', > in_amount1=8205431), Row(fdate=u'20170215', in_amount1=7070829), > Row(fdate=u'20170301', in_amount1=10159653)] > >>> print df2.collect() > [Row(fdate=u'20170223', in_amount2=7072120), Row(fdate=u'20170302', > in_amount2=5548515), Row(fdate=u'20170207', in_amount2=5451110), > Row(fdate=u'20170208', in_amount2=4483131), Row(fdate=u'20170224', > in_amount2=9674888), Row(fdate=u'20170201', in_amount2=3227502), > Row(fdate=u'20170221', in_amount2=5084800), Row(fdate=u'20170303', > in_amount2=20577801), Row(fdate=u'20170202', in_amount2=4024218), > Row(fdate=u'20170306', in_amount2=8581773), Row(fdate=u'20170227', > in_amount2=5748035), Row(fdate=u'20170206', in_amount2=7330154), > Row(fdate=u'20170217', in_amount2=6838105), Row(fdate=u'20170203', > in_amount2=9390262), Row(fdate=u'20170222', in_amount2=3800662), > Row(fdate=u'20170216', in_amount2=4338891), Row(fdate=u'20170209', > in_amount2=4024611), Row(fdate=u'20170214', in_amount2=4030389), > Row(fdate=u'20170307', in_amount2=5504936), Row(fdate=u'20170213', > in_amount2=7142428), Row(fdate=u'20170228', in_amount2=8618951), > Row(fdate=u'20170220', in_amount2=8172290), Row(fdate=u'20170210', > in_amount2=8411312), Row(fdate=u'20170215', in_amount2=5302422), > Row(fdate=u'20170301', in_amount2=9475418)] > >>> ht_net_in_df = df1.join(df2,df1.fdate == df2.fdate,'inner') > 2017-03-08 10:27:34,357 WARN [Thread-2] sql.Column: Constructing trivially > true equals predicate, 'fdate#42 = fdate#42'. Perhaps you need to use aliases. > Traceback (most recent call last): > File "", line 1, in > File "/home/spark/python/pyspark/sql/dataframe.py", line 652, in join > jdf = self._jdf.join(other._jdf, on._jc, how) > File "/home/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/home/spark/python/pyspark/sql/utils.py", line 69, in deco > raise AnalysisException(s.split(': ', 1)[1], stackTrace) > pyspark.sql.utils.AnalysisException: u" > Failure when resolving conflicting references in Join: > 'Join Inner, (fdate#42 = fdate#42) > :- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double)) > as int) AS in_amount1#97] > : +- Filter (inorout#44 = A) > : +- Project [firm_id#40, partnerid#45, inorout#44, inoutmoney#47, > fdate#42] > :+- Filter (((partnerid#45 = pmec) && NOT (firm_id#40 = NULL)) && > (NOT (firm_id#40 = -1) && (fdate#42 >= 20170201))) > : +- SubqueryAlias history_transfer_v > : +- Project [md5(cast(firmid#41 as binary)) AS FIRM_ID#40, > fdate#42, ftime#43, inorout#44, partnerid#45, realdate#46, inoutmoney#47, > bankwaterid#48, waterid#49, waterstate#50, source#51] > : +- SubqueryAlias history_transfer > :+- > Relation[firmid#41,fdate#42,ftime#43,inorout#44,partnerid#45,realdate#46,inoutmoney#47,bankwaterid#48,waterid#49,waterstate#50,source#51] > parquet > +- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double)) > as int) AS in_amount2#145] >+- Filter (inorout#44 = B) > +- Project [firm_id#40, partnerid#45, inorout#44, inoutmoney#47, > fdate#42] > +-
[jira] [Resolved] (SPARK-24290) Instrumentation Improvement: add logNamedValue taking Array types
[ https://issues.apache.org/jira/browse/SPARK-24290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-24290. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21347 [https://github.com/apache/spark/pull/21347] > Instrumentation Improvement: add logNamedValue taking Array types > - > > Key: SPARK-24290 > URL: https://issues.apache.org/jira/browse/SPARK-24290 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: Lu Wang >Assignee: Lu Wang >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24290) Instrumentation Improvement: add logNamedValue taking Array types
[ https://issues.apache.org/jira/browse/SPARK-24290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-24290: - Assignee: Lu Wang > Instrumentation Improvement: add logNamedValue taking Array types > - > > Key: SPARK-24290 > URL: https://issues.apache.org/jira/browse/SPARK-24290 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: Lu Wang >Assignee: Lu Wang >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24300) generateLDAData in ml.cluster.LDASuite didn't set seed correctly
[ https://issues.apache.org/jira/browse/SPARK-24300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500909#comment-16500909 ] Apache Spark commented on SPARK-24300: -- User 'ludatabricks' has created a pull request for this issue: https://github.com/apache/spark/pull/21492 > generateLDAData in ml.cluster.LDASuite didn't set seed correctly > > > Key: SPARK-24300 > URL: https://issues.apache.org/jira/browse/SPARK-24300 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Xiangrui Meng >Assignee: Lu Wang >Priority: Minor > > [https://github.com/apache/spark/blob/0d63ebd17df747fb41d7ba254718bb7af3ae/mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala] > > generateLDAData uses the same RNG in all partitions to generate random data. > This either causes duplicate rows in cluster mode or indeterministic behavior > in local mode: > {code:java} > scala> val rng = new java.util.Random(10) > rng: java.util.Random = java.util.Random@78c5ef58 > scala> sc.parallelize(1 to 10).map { i => Seq.fill(10)(rng.nextInt(10)) > }.collect().mkString("\n") > res12: String = > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8){code} > We should create one RNG per partition to make it safe. > > cc: [~lu.DB] [~josephkb] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24300) generateLDAData in ml.cluster.LDASuite didn't set seed correctly
[ https://issues.apache.org/jira/browse/SPARK-24300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24300: Assignee: Lu Wang (was: Apache Spark) > generateLDAData in ml.cluster.LDASuite didn't set seed correctly > > > Key: SPARK-24300 > URL: https://issues.apache.org/jira/browse/SPARK-24300 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Xiangrui Meng >Assignee: Lu Wang >Priority: Minor > > [https://github.com/apache/spark/blob/0d63ebd17df747fb41d7ba254718bb7af3ae/mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala] > > generateLDAData uses the same RNG in all partitions to generate random data. > This either causes duplicate rows in cluster mode or indeterministic behavior > in local mode: > {code:java} > scala> val rng = new java.util.Random(10) > rng: java.util.Random = java.util.Random@78c5ef58 > scala> sc.parallelize(1 to 10).map { i => Seq.fill(10)(rng.nextInt(10)) > }.collect().mkString("\n") > res12: String = > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8){code} > We should create one RNG per partition to make it safe. > > cc: [~lu.DB] [~josephkb] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24300) generateLDAData in ml.cluster.LDASuite didn't set seed correctly
[ https://issues.apache.org/jira/browse/SPARK-24300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24300: Assignee: Apache Spark (was: Lu Wang) > generateLDAData in ml.cluster.LDASuite didn't set seed correctly > > > Key: SPARK-24300 > URL: https://issues.apache.org/jira/browse/SPARK-24300 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Xiangrui Meng >Assignee: Apache Spark >Priority: Minor > > [https://github.com/apache/spark/blob/0d63ebd17df747fb41d7ba254718bb7af3ae/mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala] > > generateLDAData uses the same RNG in all partitions to generate random data. > This either causes duplicate rows in cluster mode or indeterministic behavior > in local mode: > {code:java} > scala> val rng = new java.util.Random(10) > rng: java.util.Random = java.util.Random@78c5ef58 > scala> sc.parallelize(1 to 10).map { i => Seq.fill(10)(rng.nextInt(10)) > }.collect().mkString("\n") > res12: String = > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8){code} > We should create one RNG per partition to make it safe. > > cc: [~lu.DB] [~josephkb] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24374) SPIP: Support Barrier Scheduling in Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500900#comment-16500900 ] Xiangrui Meng commented on SPARK-24374: --- [~leftnoteasy] Thanks for your input! {quote} This JIRA is trying to solve the gang scheduling problem of ML applications, however, gang scheduling should be handled by underlying resource scheduler instead of Spark. Because Spark with non-standalone deployment has no control of how to do resource allocation. {quote} Agree that the overall resource provision should be handled by the resource manager. I think this is true for standalone as well. However, inside Spark, we still need to fine-grained job scheduling to allocate tasks. For example, a skewed join might hold some task slots for quite long and hence the tasks from next stage have to wait to start all together. Ideally, Spark should be able to talk to the resource manager for better elasticity. {quote} If the proposed API is like to implement gang-scheduling by using gather-and-hold pattern, existing Spark API should be good enough – just to request resources until it reaches target #containers. Application needs to wait in both cases. {quote} The Spark API is not good enough for two reasons: 1) The all-reduce patten could be implemented by a single gather and broadcast, but driver that gathers the message would become the bottleneck when the message is big (~20 million features) or there are too many nodes. This is why we started with SPARK-1485 (all-reduce) but ended up at SPARK-2174 (tree-reduce). 2) If we ask user program to set a barrier and wait for all tasks to be ready. It won't work in case of failures, because Spark will only retry the failed task instead of all. This requires significant code changes on users side to handle the failure scenario. {quote} MPI needs launched processes to contact their master so master can launch slaves and make them to interconnect to each other (phone-home). Application needs to implement logics to talk to different RMs. {quote} [~jiangxb1987] made a prototype of this scenario, not on YARN but on standalone to help discuss the design. In the barrier stage, users can easily get node infos of all tasks. So the MPI setup could be simplified. We will take a look at mpich2-yarn code base. Is there a design doc there? So we can quickly get the high-level design choices. {quote} One potential benefit I can think about embedding app to Spark is, applications could directly read from memory of Spark tasks. {quote} I'm preparing a SPIP doc for accelerating the data exchange between Spark and 3rd-party frameworks. I would treat it as an orthogonal issue here. Fast data exchange would help the model inference use case a lot. For training, we might just need a standard data interface to simplify data conversions. > SPIP: Support Barrier Scheduling in Apache Spark > > > Key: SPARK-24374 > URL: https://issues.apache.org/jira/browse/SPARK-24374 > Project: Spark > Issue Type: Epic > Components: ML, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: SPIP > Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf > > > (See details in the linked/attached SPIP doc.) > {quote} > The proposal here is to add a new scheduling model to Apache Spark so users > can properly embed distributed DL training as a Spark stage to simplify the > distributed training workflow. For example, Horovod uses MPI to implement > all-reduce to accelerate distributed TensorFlow training. The computation > model is different from MapReduce used by Spark. In Spark, a task in a stage > doesn’t depend on any other tasks in the same stage, and hence it can be > scheduled independently. In MPI, all workers start at the same time and pass > messages around. To embed this workload in Spark, we need to introduce a new > scheduling model, tentatively named “barrier scheduling”, which launches > tasks at the same time and provides users enough information and tooling to > embed distributed DL training. Spark can also provide an extra layer of fault > tolerance in case some tasks failed in the middle, where Spark would abort > all tasks and restart the stage. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21896) Stack Overflow when window function nested inside aggregate function
[ https://issues.apache.org/jira/browse/SPARK-21896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-21896. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21473 [https://github.com/apache/spark/pull/21473] > Stack Overflow when window function nested inside aggregate function > > > Key: SPARK-21896 > URL: https://issues.apache.org/jira/browse/SPARK-21896 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: Luyao Yang >Assignee: Anton Okolnychyi >Priority: Minor > Fix For: 2.4.0 > > > A minimal example: with the following simple test data > {noformat} > >>> df = spark.createDataFrame([(1, 2), (1, 3), (2, 4)], ['a', 'b']) > >>> df.show() > +---+---+ > | a| b| > +---+---+ > | 1| 2| > | 1| 3| > | 2| 4| > +---+---+ > {noformat} > This works: > {noformat} > >>> w = Window().orderBy('b') > >>> result = (df.select(F.rank().over(w).alias('rk')) > ....groupby() > ....agg(F.max('rk')) > ... ) > >>> result.show() > +---+ > |max(rk)| > +---+ > | 3| > +---+ > {noformat} > But this equivalent gives an error. Note that the error is thrown right when > the operation is defined, not when an action is called later: > {noformat} > >>> result = (df.groupby() > ....agg(F.max(F.rank().over(w))) > ... ) > Traceback (most recent call last): > File > "/usr/local/lib/python2.7/site-packages/IPython/core/interactiveshell.py", > line 2885, in run_code > exec(code_obj, self.user_global_ns, self.user_ns) > File "", line 2, in > .agg(F.max(F.rank().over(w))) > File "/usr/lib/spark/python/pyspark/sql/group.py", line 91, in agg > _to_seq(self.sql_ctx._sc, [c._jc for c in exprs[1:]])) > File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", > line 1133, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco > return f(*a, **kw) > File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line > 319, in get_return_value > format(target_id, ".", name), value) > Py4JJavaError: An error occurred while calling o10789.agg. > : java.lang.StackOverflowError > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:55) > at > org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:400) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:381) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:277) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$71.apply(Analyzer.scala:1688) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$71.apply(Analyzer.scala:1724) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$extract(Analyzer.scala:1687) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$26.applyOrElse(Analyzer.scala:1825) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$26.applyOrElse(Analyzer.scala:1800) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:287) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331) > at > org.apache.spark.sql.catalyst.trees.TreeNode.map
[jira] [Assigned] (SPARK-21896) Stack Overflow when window function nested inside aggregate function
[ https://issues.apache.org/jira/browse/SPARK-21896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-21896: --- Assignee: Anton Okolnychyi > Stack Overflow when window function nested inside aggregate function > > > Key: SPARK-21896 > URL: https://issues.apache.org/jira/browse/SPARK-21896 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: Luyao Yang >Assignee: Anton Okolnychyi >Priority: Minor > Fix For: 2.4.0 > > > A minimal example: with the following simple test data > {noformat} > >>> df = spark.createDataFrame([(1, 2), (1, 3), (2, 4)], ['a', 'b']) > >>> df.show() > +---+---+ > | a| b| > +---+---+ > | 1| 2| > | 1| 3| > | 2| 4| > +---+---+ > {noformat} > This works: > {noformat} > >>> w = Window().orderBy('b') > >>> result = (df.select(F.rank().over(w).alias('rk')) > ....groupby() > ....agg(F.max('rk')) > ... ) > >>> result.show() > +---+ > |max(rk)| > +---+ > | 3| > +---+ > {noformat} > But this equivalent gives an error. Note that the error is thrown right when > the operation is defined, not when an action is called later: > {noformat} > >>> result = (df.groupby() > ....agg(F.max(F.rank().over(w))) > ... ) > Traceback (most recent call last): > File > "/usr/local/lib/python2.7/site-packages/IPython/core/interactiveshell.py", > line 2885, in run_code > exec(code_obj, self.user_global_ns, self.user_ns) > File "", line 2, in > .agg(F.max(F.rank().over(w))) > File "/usr/lib/spark/python/pyspark/sql/group.py", line 91, in agg > _to_seq(self.sql_ctx._sc, [c._jc for c in exprs[1:]])) > File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", > line 1133, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco > return f(*a, **kw) > File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line > 319, in get_return_value > format(target_id, ".", name), value) > Py4JJavaError: An error occurred while calling o10789.agg. > : java.lang.StackOverflowError > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:55) > at > org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:400) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:381) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:277) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$71.apply(Analyzer.scala:1688) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$71.apply(Analyzer.scala:1724) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$extract(Analyzer.scala:1687) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$26.applyOrElse(Analyzer.scala:1825) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$26.applyOrElse(Analyzer.scala:1800) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:287) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.trees.TreeNode.tr
[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500809#comment-16500809 ] Xiangrui Meng commented on SPARK-15784: --- Discussed with [~WeichenXu123] offline. I think we should change the APIs to the following: {code} class PowerIterationClustering extends Params with HasWeightCol with DefaultReadWrite { def srcCol: Param[String] def dstCol: Param[String] def wegithCol: Param[String] def assignClusters(dataset: Dataset[_]): DataFrame[id: Long, cluster: Int] } {code} > Add Power Iteration Clustering to spark.ml > -- > > Key: SPARK-15784 > URL: https://issues.apache.org/jira/browse/SPARK-15784 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xinh Huynh >Assignee: Miao Wang >Priority: Major > > Adding this algorithm is required as part of SPARK-4591: Algorithm/model > parity for spark.ml. The review JIRA for clustering is SPARK-14380. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24448) File not found on the address SparkFiles.get returns on standalone cluster
[ https://issues.apache.org/jira/browse/SPARK-24448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500794#comment-16500794 ] Pritpal Singh commented on SPARK-24448: --- [~jerryshao] , client mode does not work too. We use standalone cluster. Is it an issue with standalone? > File not found on the address SparkFiles.get returns on standalone cluster > -- > > Key: SPARK-24448 > URL: https://issues.apache.org/jira/browse/SPARK-24448 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Pritpal Singh >Priority: Major > > I want to upload a file on all worker nodes in a standalone cluster and > retrieve the location of file. Here is my code > > val tempKeyStoreLoc = System.getProperty("java.io.tmpdir") + "/keystore.jks" > val file = new File(tempKeyStoreLoc) > sparkContext.addFile(file.getAbsolutePath) > val keyLoc = SparkFiles.get("keystore.jks") > > SparkFiles.get returns a random location where keystore.jks does not exist. I > submit the job in cluster mode. In fact the location Spark.Files returns does > not exist on any of the worker nodes (including the driver node). > I observed that Spark does load keystore.jks files on worker nodes at > /work///keystore.jks. The partition_id > changes from one worker node to another. > My requirement is to upload a file on all nodes of a cluster and retrieve its > location. I'm expecting the location to be common across all worker nodes. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24453) Fix error recovering from the failure in a no-data batch
[ https://issues.apache.org/jira/browse/SPARK-24453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24453: Assignee: Apache Spark (was: Tathagata Das) > Fix error recovering from the failure in a no-data batch > > > Key: SPARK-24453 > URL: https://issues.apache.org/jira/browse/SPARK-24453 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Tathagata Das >Assignee: Apache Spark >Priority: Major > > ``` > java.lang.AssertionError: assertion failed: Concurrent update to the log. > Multiple streaming jobs detected for 159897 > ``` > The error occurs when we are recovering from a failure in a no-data batch > (say X) that has been planned (i.e. written to offset log) but not executed > (i.e. not written to commit log). Upon recovery, the following sequence of > events happen. > - `MicroBatchExecution.populateStartOffsets` sets `currentBatchId` to X. > Since there was no data in the batch, the `availableOffsets` is same as > `committedOffsets`, so `isNewDataAvailable` is false. > - When MicroBatchExecution.constructNextBatch is called, ideally it should > immediately return true because the next batch has already been constructed. > However, the check of whether the batch has been constructed was `if > (isNewDataAvailable) return true`. Since the planned batch is a no-data > batch, it escaped this check and proceeded to plan the same batch X once > again. And if there is new data since the failure, it does plan a new batch, > and try to write new offsets to the `offsetLog` as batchId X, and fail with > the above error. > The correct solution is to check the offset log whether the currentBatchId is > the latest or not. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24453) Fix error recovering from the failure in a no-data batch
[ https://issues.apache.org/jira/browse/SPARK-24453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24453: Assignee: Tathagata Das (was: Apache Spark) > Fix error recovering from the failure in a no-data batch > > > Key: SPARK-24453 > URL: https://issues.apache.org/jira/browse/SPARK-24453 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > > ``` > java.lang.AssertionError: assertion failed: Concurrent update to the log. > Multiple streaming jobs detected for 159897 > ``` > The error occurs when we are recovering from a failure in a no-data batch > (say X) that has been planned (i.e. written to offset log) but not executed > (i.e. not written to commit log). Upon recovery, the following sequence of > events happen. > - `MicroBatchExecution.populateStartOffsets` sets `currentBatchId` to X. > Since there was no data in the batch, the `availableOffsets` is same as > `committedOffsets`, so `isNewDataAvailable` is false. > - When MicroBatchExecution.constructNextBatch is called, ideally it should > immediately return true because the next batch has already been constructed. > However, the check of whether the batch has been constructed was `if > (isNewDataAvailable) return true`. Since the planned batch is a no-data > batch, it escaped this check and proceeded to plan the same batch X once > again. And if there is new data since the failure, it does plan a new batch, > and try to write new offsets to the `offsetLog` as batchId X, and fail with > the above error. > The correct solution is to check the offset log whether the currentBatchId is > the latest or not. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24453) Fix error recovering from the failure in a no-data batch
[ https://issues.apache.org/jira/browse/SPARK-24453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500770#comment-16500770 ] Apache Spark commented on SPARK-24453: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/21491 > Fix error recovering from the failure in a no-data batch > > > Key: SPARK-24453 > URL: https://issues.apache.org/jira/browse/SPARK-24453 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > > ``` > java.lang.AssertionError: assertion failed: Concurrent update to the log. > Multiple streaming jobs detected for 159897 > ``` > The error occurs when we are recovering from a failure in a no-data batch > (say X) that has been planned (i.e. written to offset log) but not executed > (i.e. not written to commit log). Upon recovery, the following sequence of > events happen. > - `MicroBatchExecution.populateStartOffsets` sets `currentBatchId` to X. > Since there was no data in the batch, the `availableOffsets` is same as > `committedOffsets`, so `isNewDataAvailable` is false. > - When MicroBatchExecution.constructNextBatch is called, ideally it should > immediately return true because the next batch has already been constructed. > However, the check of whether the batch has been constructed was `if > (isNewDataAvailable) return true`. Since the planned batch is a no-data > batch, it escaped this check and proceeded to plan the same batch X once > again. And if there is new data since the failure, it does plan a new batch, > and try to write new offsets to the `offsetLog` as batchId X, and fail with > the above error. > The correct solution is to check the offset log whether the currentBatchId is > the latest or not. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500639#comment-16500639 ] Reynold Xin commented on SPARK-24359: - Why would a separate repo lead to faster iteration? What's the difference between that and just a directory in mainline repo that's not part of the same build with the mainline repo? > SPIP: ML Pipelines in R > --- > > Key: SPARK-24359 > URL: https://issues.apache.org/jira/browse/SPARK-24359 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hossein Falaki >Priority: Major > Labels: SPIP > Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines > in R-v3.pdf, SparkML_ ML Pipelines in R.pdf > > > h1. Background and motivation > SparkR supports calling MLlib functionality with an [R-friendly > API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. > Since Spark 1.5 the (new) SparkML API which is based on [pipelines and > parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] > has matured significantly. It allows users build and maintain complicated > machine learning pipelines. A lot of this functionality is difficult to > expose using the simple formula-based API in SparkR. > We propose a new R package, _SparkML_, to be distributed along with SparkR as > part of Apache Spark. This new package will be built on top of SparkR’s APIs > to expose SparkML’s pipeline APIs and functionality. > *Why not SparkR?* > SparkR package contains ~300 functions. Many of these shadow functions in > base and other popular CRAN packages. We think adding more functions to > SparkR will degrade usability and make maintenance harder. > *Why not sparklyr?* > sparklyr is an R package developed by RStudio Inc. to expose Spark API to R > users. sparklyr includes MLlib API wrappers, but to the best of our knowledge > they are not comprehensive. Also we propose a code-gen approach for this > package to minimize work needed to expose future MLlib API, but sparklyr’s > API is manually written. > h1. Target Personas > * Existing SparkR users who need more flexible SparkML API > * R users (data scientists, statisticians) who wish to build Spark ML > pipelines in R > h1. Goals > * R users can install SparkML from CRAN > * R users will be able to import SparkML independent from SparkR > * After setting up a Spark session R users can > ** create a pipeline by chaining individual components and specifying their > parameters > ** tune a pipeline in parallel, taking advantage of Spark > ** inspect a pipeline’s parameters and evaluation metrics > ** repeatedly apply a pipeline > * MLlib contributors can easily add R wrappers for new MLlib Estimators and > Transformers > h1. Non-Goals > * Adding new algorithms to SparkML R package which do not exist in Scala > * Parallelizing existing CRAN packages > * Changing existing SparkR ML wrapping API > h1. Proposed API Changes > h2. Design goals > When encountering trade-offs in API, we will chose based on the following > list of priorities. The API choice that addresses a higher priority goal will > be chosen. > # *Comprehensive coverage of MLlib API:* Design choices that make R coverage > of future ML algorithms difficult will be ruled out. > * *Semantic clarity*: We attempt to minimize confusion with other packages. > Between consciousness and clarity, we will choose clarity. > * *Maintainability and testability:* API choices that require manual > maintenance or make testing difficult should be avoided. > * *Interoperability with rest of Spark components:* We will keep the R API > as thin as possible and keep all functionality implementation in JVM/Scala. > * *Being natural to R users:* Ultimate users of this package are R users and > they should find it easy and natural to use. > The API will follow familiar R function syntax, where the object is passed as > the first argument of the method: do_something(obj, arg1, arg2). All > functions are snake_case (e.g., {{spark_logistic_regression()}} and > {{set_max_iter()}}). If a constructor gets arguments, they will be named > arguments. For example: > {code:java} > > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), > > 0.1){code} > When calls need to be chained, like above example, syntax can nicely > translate to a natural pipeline style with help from very popular[ magrittr > package|https://cran.r-project.org/web/packages/magrittr/index.html]. For > example: > {code:java} > > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> > > lr{code} > h2. Namespace > All new API will be under a new CRAN package, named SparkML. The package > should be usable without
[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500634#comment-16500634 ] Felix Cheung commented on SPARK-24359: -- +1 on the `spark-website` model for faster iterations. This was my suggestion originally not just for releases but to publish on CRAN. But if you can get the package source into a state to "work" without Spark (JVM) and SparkR, then it will make the publication process easier. > SPIP: ML Pipelines in R > --- > > Key: SPARK-24359 > URL: https://issues.apache.org/jira/browse/SPARK-24359 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hossein Falaki >Priority: Major > Labels: SPIP > Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines > in R-v3.pdf, SparkML_ ML Pipelines in R.pdf > > > h1. Background and motivation > SparkR supports calling MLlib functionality with an [R-friendly > API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. > Since Spark 1.5 the (new) SparkML API which is based on [pipelines and > parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] > has matured significantly. It allows users build and maintain complicated > machine learning pipelines. A lot of this functionality is difficult to > expose using the simple formula-based API in SparkR. > We propose a new R package, _SparkML_, to be distributed along with SparkR as > part of Apache Spark. This new package will be built on top of SparkR’s APIs > to expose SparkML’s pipeline APIs and functionality. > *Why not SparkR?* > SparkR package contains ~300 functions. Many of these shadow functions in > base and other popular CRAN packages. We think adding more functions to > SparkR will degrade usability and make maintenance harder. > *Why not sparklyr?* > sparklyr is an R package developed by RStudio Inc. to expose Spark API to R > users. sparklyr includes MLlib API wrappers, but to the best of our knowledge > they are not comprehensive. Also we propose a code-gen approach for this > package to minimize work needed to expose future MLlib API, but sparklyr’s > API is manually written. > h1. Target Personas > * Existing SparkR users who need more flexible SparkML API > * R users (data scientists, statisticians) who wish to build Spark ML > pipelines in R > h1. Goals > * R users can install SparkML from CRAN > * R users will be able to import SparkML independent from SparkR > * After setting up a Spark session R users can > ** create a pipeline by chaining individual components and specifying their > parameters > ** tune a pipeline in parallel, taking advantage of Spark > ** inspect a pipeline’s parameters and evaluation metrics > ** repeatedly apply a pipeline > * MLlib contributors can easily add R wrappers for new MLlib Estimators and > Transformers > h1. Non-Goals > * Adding new algorithms to SparkML R package which do not exist in Scala > * Parallelizing existing CRAN packages > * Changing existing SparkR ML wrapping API > h1. Proposed API Changes > h2. Design goals > When encountering trade-offs in API, we will chose based on the following > list of priorities. The API choice that addresses a higher priority goal will > be chosen. > # *Comprehensive coverage of MLlib API:* Design choices that make R coverage > of future ML algorithms difficult will be ruled out. > * *Semantic clarity*: We attempt to minimize confusion with other packages. > Between consciousness and clarity, we will choose clarity. > * *Maintainability and testability:* API choices that require manual > maintenance or make testing difficult should be avoided. > * *Interoperability with rest of Spark components:* We will keep the R API > as thin as possible and keep all functionality implementation in JVM/Scala. > * *Being natural to R users:* Ultimate users of this package are R users and > they should find it easy and natural to use. > The API will follow familiar R function syntax, where the object is passed as > the first argument of the method: do_something(obj, arg1, arg2). All > functions are snake_case (e.g., {{spark_logistic_regression()}} and > {{set_max_iter()}}). If a constructor gets arguments, they will be named > arguments. For example: > {code:java} > > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), > > 0.1){code} > When calls need to be chained, like above example, syntax can nicely > translate to a natural pipeline style with help from very popular[ magrittr > package|https://cran.r-project.org/web/packages/magrittr/index.html]. For > example: > {code:java} > > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> > > lr{code} > h2. Namespace >
[jira] [Assigned] (SPARK-24462) Text socket micro-batch reader throws error when a query is restarted with saved state
[ https://issues.apache.org/jira/browse/SPARK-24462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24462: Assignee: Apache Spark > Text socket micro-batch reader throws error when a query is restarted with > saved state > -- > > Key: SPARK-24462 > URL: https://issues.apache.org/jira/browse/SPARK-24462 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Arun Mahadevan >Assignee: Apache Spark >Priority: Critical > > Exception thrown: > > {noformat} > scala> 18/06/01 22:47:04 ERROR MicroBatchExecution: Query [id = > 0bdc4428-5d21-4237-9d64-898ae65f28f3, runId = > f6822423-2bd2-47c1-8ed6-799d1c481195] terminated with error > java.lang.RuntimeException: Offsets committed out of order: 2 followed by -1 > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.streaming.sources.TextSocketMicroBatchReader.commit(socket.scala:197) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$2$$anonfun$apply$mcV$sp$5.apply(MicroBatchExecution.scala:377) > > {noformat} > > Sample code that reproduces the error on restarting the query. > > {code:java} > > import java.sql.Timestamp > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.functions._ > import spark.implicits._ > import org.apache.spark.sql.streaming.Trigger > val lines = spark.readStream.format("socket").option("host", > "localhost").option("port", ).option("includeTimestamp", true).load() > val words = lines.as[(String, Timestamp)].flatMap(line => line._1.split(" > ").map(word => (word, line._2))).toDF("word", "timestamp") > val windowedCounts = words.groupBy(window($"timestamp", "20 minutes", "20 > minutes"), $"word").count().orderBy("window") > val query = > windowedCounts.writeStream.outputMode("complete").option("checkpointLocation", > "/tmp/debug").format("console").option("truncate", "false").start() > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24462) Text socket micro-batch reader throws error when a query is restarted with saved state
[ https://issues.apache.org/jira/browse/SPARK-24462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24462: Assignee: (was: Apache Spark) > Text socket micro-batch reader throws error when a query is restarted with > saved state > -- > > Key: SPARK-24462 > URL: https://issues.apache.org/jira/browse/SPARK-24462 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Arun Mahadevan >Priority: Critical > > Exception thrown: > > {noformat} > scala> 18/06/01 22:47:04 ERROR MicroBatchExecution: Query [id = > 0bdc4428-5d21-4237-9d64-898ae65f28f3, runId = > f6822423-2bd2-47c1-8ed6-799d1c481195] terminated with error > java.lang.RuntimeException: Offsets committed out of order: 2 followed by -1 > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.streaming.sources.TextSocketMicroBatchReader.commit(socket.scala:197) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$2$$anonfun$apply$mcV$sp$5.apply(MicroBatchExecution.scala:377) > > {noformat} > > Sample code that reproduces the error on restarting the query. > > {code:java} > > import java.sql.Timestamp > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.functions._ > import spark.implicits._ > import org.apache.spark.sql.streaming.Trigger > val lines = spark.readStream.format("socket").option("host", > "localhost").option("port", ).option("includeTimestamp", true).load() > val words = lines.as[(String, Timestamp)].flatMap(line => line._1.split(" > ").map(word => (word, line._2))).toDF("word", "timestamp") > val windowedCounts = words.groupBy(window($"timestamp", "20 minutes", "20 > minutes"), $"word").count().orderBy("window") > val query = > windowedCounts.writeStream.outputMode("complete").option("checkpointLocation", > "/tmp/debug").format("console").option("truncate", "false").start() > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24462) Text socket micro-batch reader throws error when a query is restarted with saved state
[ https://issues.apache.org/jira/browse/SPARK-24462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500606#comment-16500606 ] Apache Spark commented on SPARK-24462: -- User 'arunmahadevan' has created a pull request for this issue: https://github.com/apache/spark/pull/21490 > Text socket micro-batch reader throws error when a query is restarted with > saved state > -- > > Key: SPARK-24462 > URL: https://issues.apache.org/jira/browse/SPARK-24462 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Arun Mahadevan >Priority: Critical > > Exception thrown: > > {noformat} > scala> 18/06/01 22:47:04 ERROR MicroBatchExecution: Query [id = > 0bdc4428-5d21-4237-9d64-898ae65f28f3, runId = > f6822423-2bd2-47c1-8ed6-799d1c481195] terminated with error > java.lang.RuntimeException: Offsets committed out of order: 2 followed by -1 > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.streaming.sources.TextSocketMicroBatchReader.commit(socket.scala:197) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$2$$anonfun$apply$mcV$sp$5.apply(MicroBatchExecution.scala:377) > > {noformat} > > Sample code that reproduces the error on restarting the query. > > {code:java} > > import java.sql.Timestamp > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.functions._ > import spark.implicits._ > import org.apache.spark.sql.streaming.Trigger > val lines = spark.readStream.format("socket").option("host", > "localhost").option("port", ).option("includeTimestamp", true).load() > val words = lines.as[(String, Timestamp)].flatMap(line => line._1.split(" > ").map(word => (word, line._2))).toDF("word", "timestamp") > val windowedCounts = words.groupBy(window($"timestamp", "20 minutes", "20 > minutes"), $"word").count().orderBy("window") > val query = > windowedCounts.writeStream.outputMode("complete").option("checkpointLocation", > "/tmp/debug").format("console").option("truncate", "false").start() > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24462) Text socket micro-batch reader throws error when a query is restarted with saved state
Arun Mahadevan created SPARK-24462: -- Summary: Text socket micro-batch reader throws error when a query is restarted with saved state Key: SPARK-24462 URL: https://issues.apache.org/jira/browse/SPARK-24462 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.3.0 Reporter: Arun Mahadevan Exception thrown: {noformat} scala> 18/06/01 22:47:04 ERROR MicroBatchExecution: Query [id = 0bdc4428-5d21-4237-9d64-898ae65f28f3, runId = f6822423-2bd2-47c1-8ed6-799d1c481195] terminated with error java.lang.RuntimeException: Offsets committed out of order: 2 followed by -1 at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.execution.streaming.sources.TextSocketMicroBatchReader.commit(socket.scala:197) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$2$$anonfun$apply$mcV$sp$5.apply(MicroBatchExecution.scala:377) {noformat} Sample code that reproduces the error on restarting the query. {code:java} import java.sql.Timestamp import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions._ import spark.implicits._ import org.apache.spark.sql.streaming.Trigger val lines = spark.readStream.format("socket").option("host", "localhost").option("port", ).option("includeTimestamp", true).load() val words = lines.as[(String, Timestamp)].flatMap(line => line._1.split(" ").map(word => (word, line._2))).toDF("word", "timestamp") val windowedCounts = words.groupBy(window($"timestamp", "20 minutes", "20 minutes"), $"word").count().orderBy("window") val query = windowedCounts.writeStream.outputMode("complete").option("checkpointLocation", "/tmp/debug").format("console").option("truncate", "false").start() {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive
[ https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500560#comment-16500560 ] Steve Loughran commented on SPARK-20202: I think you could split things into two # a modified hive 1.2.1.x for hadoop 3, with a new package name in the maven builds. (joy, profiles!) and some work with the hive team to get this officially published by them. Strength: easy for people to backport into shipping 2.2, 2.3 builds just by changing the POM # the bigger move to Hive 2. This will be the best for future, but is bound to have more surprises. There's even the possibility that the hive team might have to make some changes too, which isn't impossible if the timelines line up. > Remove references to org.spark-project.hive > --- > > Key: SPARK-20202 > URL: https://issues.apache.org/jira/browse/SPARK-20202 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.6.4, 2.0.3, 2.1.1 >Reporter: Owen O'Malley >Priority: Major > > Spark can't continue to depend on their fork of Hive and must move to > standard Hive versions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23903) Add support for date extract
[ https://issues.apache.org/jira/browse/SPARK-23903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-23903. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21479 [https://github.com/apache/spark/pull/21479] > Add support for date extract > > > Key: SPARK-23903 > URL: https://issues.apache.org/jira/browse/SPARK-23903 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Yuming Wang >Priority: Major > Fix For: 2.4.0 > > > Heavily used in timeseries based datasets. > https://www.postgresql.org/docs/9.1/static/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT > {noformat} > EXTRACT(field FROM source) > {noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23903) Add support for date extract
[ https://issues.apache.org/jira/browse/SPARK-23903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin reassigned SPARK-23903: - Assignee: Yuming Wang > Add support for date extract > > > Key: SPARK-23903 > URL: https://issues.apache.org/jira/browse/SPARK-23903 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Yuming Wang >Priority: Major > > Heavily used in timeseries based datasets. > https://www.postgresql.org/docs/9.1/static/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT > {noformat} > EXTRACT(field FROM source) > {noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24461) Snapshot Cache
[ https://issues.apache.org/jira/browse/SPARK-24461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500539#comment-16500539 ] Xiao Li commented on SPARK-24461: - cc [~maryannxue] > Snapshot Cache > -- > > Key: SPARK-24461 > URL: https://issues.apache.org/jira/browse/SPARK-24461 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > In some usage scenarios, data staleness is not critical. We can introduce a > snapshot cache of the original data for achieving much better performance. > Different from the current cache, it is resolved by names instead of by plan > matching. Cache rebuild can be manually or by events. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24461) Snapshot Cache
Xiao Li created SPARK-24461: --- Summary: Snapshot Cache Key: SPARK-24461 URL: https://issues.apache.org/jira/browse/SPARK-24461 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.3.0 Reporter: Xiao Li In some usage scenarios, data staleness is not critical. We can introduce a snapshot cache of the original data for achieving much better performance. Different from the current cache, it is resolved by names instead of by plan matching. Cache rebuild can be manually or by events. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24460) exactly-once mode
Jose Torres created SPARK-24460: --- Summary: exactly-once mode Key: SPARK-24460 URL: https://issues.apache.org/jira/browse/SPARK-24460 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 2.4.0 Reporter: Jose Torres Major changes we know will need to be made: * Restart strategy needs to replay offsets already in the log as microbatches. * Some kind of epoch alignment - need to think about this further. We'll also need a test plan to ensure that we've actually achieved exactly once. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24459) watermarks
Jose Torres created SPARK-24459: --- Summary: watermarks Key: SPARK-24459 URL: https://issues.apache.org/jira/browse/SPARK-24459 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 2.4.0 Reporter: Jose Torres -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24210) incorrect handling of boolean expressions when using column in expressions in pyspark.sql.DataFrame filter function
[ https://issues.apache.org/jira/browse/SPARK-24210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500383#comment-16500383 ] Li Yuanjian edited comment on SPARK-24210 at 6/4/18 3:34 PM: - I think it maybe not a bug. {code:python} #KO: returns r1 and r3ex.filter(('c1 = 1') and ('c2 = 1')).show() {code} This cause by python self base string __and__ implementation. After passing to df.filter, there's only 'c2 = 1'. {code:python} #KO: returns r0 and r3ex.filter('c1 = 1 & c2 = 1').show() #KO: returns r0 and r3ex.filter('c1 == 1 & c2 == 1').show() {code} As you mentioned, [https://github.com/apache/spark/pull/6961] actually fix the '&' between column, but not string expression like 'c1 = 1 & c2 = 1', here in ex.filter('c1 = 1 & c2 = 1'), Spark parse it to valueExpression like: 'Filter (('a = (1 & 'b)) = 1), I think this make sense here. was (Author: xuanyuan): I think it maybe not a bug. #KO: returns r1 and r3ex.filter(('c1 = 1') and ('c2 = 1')).show() This cause by python self base string __and__ implementation. After passing to df.filter, there's only 'c2 = 1'. #KO: returns r0 and r3ex.filter('c1 = 1 & c2 = 1').show()#KO: returns r0 and r3ex.filter('c1 == 1 & c2 == 1').show() As you mentioned, [https://github.com/apache/spark/pull/6961] actually fix the '&' between column, but not string expression like 'c1 = 1 & c2 = 1', here in ex.filter('c1 = 1 & c2 = 1'), Spark parse it to valueExpression like: 'Filter (('a = (1 & 'b)) = 1), I think this make sense here. > incorrect handling of boolean expressions when using column in expressions in > pyspark.sql.DataFrame filter function > --- > > Key: SPARK-24210 > URL: https://issues.apache.org/jira/browse/SPARK-24210 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.2 >Reporter: Michael H >Priority: Major > > {code:python} > ex = spark.createDataFrame([ > ('r0', 0, 0), > ('r1', 0, 1), > ('r2', 1, 0), > ('r3', 1, 1)]\ > , "row: string, c1: int, c2: int") > #KO: returns r1 and r3 > ex.filter(('c1 = 1') and ('c2 = 1')).show() > #OK, raises an exception > ex.filter(('c1 == 1') & ('c2 == 1')).show() > #KO: returns r0 and r3 > ex.filter('c1 = 1 & c2 = 1').show() > #KO: returns r0 and r3 > ex.filter('c1 == 1 & c2 == 1').show() > #OK: returns r3 only > ex.filter('c1 = 1 and c2 = 1').show() > #OK: returns r3 only > ex.filter('c1 == 1 and c2 == 1').show() > {code} > building the expressions using {code}ex.c1{code} or {code}ex['c1']{code} we > don't have this. > Issue seems related with > https://github.com/apache/spark/pull/6961 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24210) incorrect handling of boolean expressions when using column in expressions in pyspark.sql.DataFrame filter function
[ https://issues.apache.org/jira/browse/SPARK-24210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500383#comment-16500383 ] Li Yuanjian commented on SPARK-24210: - I think it maybe not a bug. #KO: returns r1 and r3ex.filter(('c1 = 1') and ('c2 = 1')).show() This cause by python self base string __and__ implementation. After passing to df.filter, there's only 'c2 = 1'. #KO: returns r0 and r3ex.filter('c1 = 1 & c2 = 1').show()#KO: returns r0 and r3ex.filter('c1 == 1 & c2 == 1').show() As you mentioned, [https://github.com/apache/spark/pull/6961] actually fix the '&' between column, but not string expression like 'c1 = 1 & c2 = 1', here in ex.filter('c1 = 1 & c2 = 1'), Spark parse it to valueExpression like: 'Filter (('a = (1 & 'b)) = 1), I think this make sense here. > incorrect handling of boolean expressions when using column in expressions in > pyspark.sql.DataFrame filter function > --- > > Key: SPARK-24210 > URL: https://issues.apache.org/jira/browse/SPARK-24210 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.2 >Reporter: Michael H >Priority: Major > > {code:python} > ex = spark.createDataFrame([ > ('r0', 0, 0), > ('r1', 0, 1), > ('r2', 1, 0), > ('r3', 1, 1)]\ > , "row: string, c1: int, c2: int") > #KO: returns r1 and r3 > ex.filter(('c1 = 1') and ('c2 = 1')).show() > #OK, raises an exception > ex.filter(('c1 == 1') & ('c2 == 1')).show() > #KO: returns r0 and r3 > ex.filter('c1 = 1 & c2 = 1').show() > #KO: returns r0 and r3 > ex.filter('c1 == 1 & c2 == 1').show() > #OK: returns r3 only > ex.filter('c1 = 1 and c2 = 1').show() > #OK: returns r3 only > ex.filter('c1 == 1 and c2 == 1').show() > {code} > building the expressions using {code}ex.c1{code} or {code}ex['c1']{code} we > don't have this. > Issue seems related with > https://github.com/apache/spark/pull/6961 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10409) Multilayer perceptron regression
[ https://issues.apache.org/jira/browse/SPARK-10409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500278#comment-16500278 ] Angel Conde commented on SPARK-10409: -- Another important use case for the MLP regressor is for forecasting machine sensor data. One of our clients uses this approach for predictive maintenance on Industry 4.0. assets. We were hoping to replace their custom implementation using ad-hoc library with Spark ML implementation but we are blocked until this get merged. Can't go into too much detail about the use case, but it's in production in industrial environments. The general approach is to predict one sensor value based on others. > Multilayer perceptron regression > > > Key: SPARK-10409 > URL: https://issues.apache.org/jira/browse/SPARK-10409 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Priority: Minor > > Implement regression based on multilayer perceptron (MLP). It should support > different kinds of outputs: binary, real in [0;1) and real in [-inf; +inf]. > The implementation might take advantage of autoencoder. Time-series > forecasting for financial data might be one of the use cases, see > http://dl.acm.org/citation.cfm?id=561452. So there is the need for more > specific requirements from this (or other) area. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24458) Invalid PythonUDF check_1(), requires attributes from more than one child
Abdeali Kothari created SPARK-24458: --- Summary: Invalid PythonUDF check_1(), requires attributes from more than one child Key: SPARK-24458 URL: https://issues.apache.org/jira/browse/SPARK-24458 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.0 Environment: Spark 2.3.0 (local mode) Mac OSX Reporter: Abdeali Kothari I was trying out a very large query execution plan I have and I got the error: {code:java} py4j.protocol.Py4JJavaError: An error occurred while calling o359.simpleString. : java.lang.RuntimeException: Invalid PythonUDF check_1(), requires attributes from more than one child. at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:182) at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:181) at scala.collection.immutable.Stream.foreach(Stream.scala:594) at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:181) at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:118) at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286) at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:114) at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:94) at org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:87) at org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:87) at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) at scala.collection.immutable.List.foldLeft(List.scala:84) at org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:87) at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77) at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77) at org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:187) at org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:187) at org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:100) at org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:187) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748){code} I get a dataframe (df) after a lot of P
[jira] [Created] (SPARK-24457) Performance improvement while converting stringToTimestamp in DateTimeUtils
Sharad Sonker created SPARK-24457: - Summary: Performance improvement while converting stringToTimestamp in DateTimeUtils Key: SPARK-24457 URL: https://issues.apache.org/jira/browse/SPARK-24457 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Sharad Sonker stringToTimestamp in DateTimeUtils creates Calendar instance for each input row even if the input timezone is same. This can be improved by caching the calendar instance for each input timezone. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org