[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0

2018-06-04 Thread Ismael Juma (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16501351#comment-16501351
 ] 

Ismael Juma commented on SPARK-18057:
-

Yes, it is [~kabhwan]. The major version bump is simply because support for 
Java 7 has been dropped.

> Update structured streaming kafka from 0.10.0.1 to 1.1.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Priority: Major
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24418) Upgrade to Scala 2.11.12

2018-06-04 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-24418:

Description: 
Scala 2.11.12+ will support JDK9+. However, this is not going to be a simple 
version bump. 

*loadFIles()* in *ILoop* was removed in Scala 2.11.12. We use it as a hack to 
initialize the Spark before REPL sees any files.

Issue filed in Scala community.
https://github.com/scala/bug/issues/10913

  was:
Scala 2.11.12+ will support JDK9+. However, this is not goin to be a simple 
version bump. 

*loadFIles()* in *ILoop* was removed in Scala 2.11.12. We use it as a hack to 
initialize the Spark before REPL sees any files.

Issue filed in Scala community.
https://github.com/scala/bug/issues/10913


> Upgrade to Scala 2.11.12
> 
>
> Key: SPARK-24418
> URL: https://issues.apache.org/jira/browse/SPARK-24418
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.0
>
>
> Scala 2.11.12+ will support JDK9+. However, this is not going to be a simple 
> version bump. 
> *loadFIles()* in *ILoop* was removed in Scala 2.11.12. We use it as a hack to 
> initialize the Spark before REPL sees any files.
> Issue filed in Scala community.
> https://github.com/scala/bug/issues/10913



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24418) Upgrade to Scala 2.11.12

2018-06-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24418:


Assignee: DB Tsai  (was: Apache Spark)

> Upgrade to Scala 2.11.12
> 
>
> Key: SPARK-24418
> URL: https://issues.apache.org/jira/browse/SPARK-24418
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.0
>
>
> Scala 2.11.12+ will support JDK9+. However, this is not goin to be a simple 
> version bump. 
> *loadFIles()* in *ILoop* was removed in Scala 2.11.12. We use it as a hack to 
> initialize the Spark before REPL sees any files.
> Issue filed in Scala community.
> https://github.com/scala/bug/issues/10913



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24418) Upgrade to Scala 2.11.12

2018-06-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24418:


Assignee: Apache Spark  (was: DB Tsai)

> Upgrade to Scala 2.11.12
> 
>
> Key: SPARK-24418
> URL: https://issues.apache.org/jira/browse/SPARK-24418
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: Apache Spark
>Priority: Major
> Fix For: 2.4.0
>
>
> Scala 2.11.12+ will support JDK9+. However, this is not goin to be a simple 
> version bump. 
> *loadFIles()* in *ILoop* was removed in Scala 2.11.12. We use it as a hack to 
> initialize the Spark before REPL sees any files.
> Issue filed in Scala community.
> https://github.com/scala/bug/issues/10913



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24418) Upgrade to Scala 2.11.12

2018-06-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16501337#comment-16501337
 ] 

Apache Spark commented on SPARK-24418:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/21495

> Upgrade to Scala 2.11.12
> 
>
> Key: SPARK-24418
> URL: https://issues.apache.org/jira/browse/SPARK-24418
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.0
>
>
> Scala 2.11.12+ will support JDK9+. However, this is not goin to be a simple 
> version bump. 
> *loadFIles()* in *ILoop* was removed in Scala 2.11.12. We use it as a hack to 
> initialize the Spark before REPL sees any files.
> Issue filed in Scala community.
> https://github.com/scala/bug/issues/10913



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0

2018-06-04 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16501334#comment-16501334
 ] 

Jungtaek Lim commented on SPARK-18057:
--

Is Kafka 2.0.0 client compatible with Kafka 1.x and 0.10.x brokers? I guess end 
users might hesitate to use the latest version in production, especially the 
major version is changed. Supporting broker version range is the most important 
thing to consider while upgrading.

> Update structured streaming kafka from 0.10.0.1 to 1.1.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Priority: Major
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24467) VectorAssemblerEstimator

2018-06-04 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-24467:
-

 Summary: VectorAssemblerEstimator
 Key: SPARK-24467
 URL: https://issues.apache.org/jira/browse/SPARK-24467
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.4.0
Reporter: Joseph K. Bradley


In [SPARK-22346], I believe I made a wrong API decision: I recommended added 
`VectorSizeHint` instead of making `VectorAssembler` into an Estimator since I 
thought the latter option would break most workflows.  However, I should have 
proposed:
* Add a Param to VectorAssembler for specifying the sizes of Vectors in the 
inputCols.  This Param can be optional.  If not given, then VectorAssembler 
will behave as it does now.  If given, then VectorAssembler can use that info 
instead of figuring out the Vector sizes via metadata or examining Rows in the 
data (though it could do consistency checks).
* Add a VectorAssemblerEstimator which gets the Vector lengths from data and 
produces a VectorAssembler with the vector lengths Param specified.

This will not break existing workflows.  Migrating to VectorAssemblerEstimator 
will be easier than adding VectorSizeHint since it will not require users to 
manually input Vector lengths.

Note: Even with this Estimator, VectorSizeHint might prove useful for other 
things in the future which require vector length metadata, so we could consider 
keeping it rather than deprecating it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19826) spark.ml Python API for PIC

2018-06-04 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-19826:
--
Shepherd: Weichen Xu

> spark.ml Python API for PIC
> ---
>
> Key: SPARK-19826
> URL: https://issues.apache.org/jira/browse/SPARK-19826
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24374) SPIP: Support Barrier Scheduling in Apache Spark

2018-06-04 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-24374:
--
Epic Name: Support Barrier Execution Mode

> SPIP: Support Barrier Scheduling in Apache Spark
> 
>
> Key: SPARK-24374
> URL: https://issues.apache.org/jira/browse/SPARK-24374
> Project: Spark
>  Issue Type: Epic
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked/attached SPIP doc.)
> {quote}
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19826) spark.ml Python API for PIC

2018-06-04 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-19826:
-

Assignee: Huaxin Gao

> spark.ml Python API for PIC
> ---
>
> Key: SPARK-19826
> URL: https://issues.apache.org/jira/browse/SPARK-19826
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2018-06-04 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-15784.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21493
[https://github.com/apache/spark/pull/21493]

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>Assignee: Miao Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24359) SPIP: ML Pipelines in R

2018-06-04 Thread Hossein Falaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16501254#comment-16501254
 ] 

Hossein Falaki edited comment on SPARK-24359 at 6/5/18 3:51 AM:


[~shivaram] what prevents us from creating a tag like SparkML-2.4.0.1 and 
SparkML-2.4.0.2 (or some other variant like that) in the main Spark repo? Also, 
if you think initially this will be unclear, we don't have to submit SparkML to 
CRAN in its first release. Similar to SparkR we can wait a bit until we are 
confident about its compatibility. Many users and distributions, distribute 
SparkR from Apache rather than CRAN (one example is Databricks. We build SparkR 
from source) and they would be able to give us feedback while the project is in 
alpha state.


was (Author: falaki):
[~shivaram] what prevents us from creating a tag like SparkML-2.4.0.1 and 
SparkML-2.4.0.2 (or some other variant like that) in the main Spark repo? Also, 
if you think initially this will be unclear, we don't have to submit SparkML to 
CRAN in its first release. Similar to SparkR we can wait a bit until we are 
confident about its compatibility. Many users and distributions, distribute 
SparkR from Apache rather than CRAN. One example is Databricks. We build SparkR 
from source.

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R use

[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-06-04 Thread Hossein Falaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16501254#comment-16501254
 ] 

Hossein Falaki commented on SPARK-24359:


[~shivaram] what prevents us from creating a tag like SparkML-2.4.0.1 and 
SparkML-2.4.0.2 (or some other variant like that) in the main Spark repo? Also, 
if you think initially this will be unclear, we don't have to submit SparkML to 
CRAN in its first release. Similar to SparkR we can wait a bit until we are 
confident about its compatibility. Many users and distributions, distribute 
SparkR from Apache rather than CRAN. One example is Databricks. We build SparkR 
from source.

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., {{spark_logistic_regression()}} and 
> {{set_max_iter()}}). If a constructor gets arguments, they will be named 
> arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 
> > 0.1){code}
> When calls need to be chained, like above example, syntax can nicely 
> translate to a natural pipeline style with help from very popular[ magrittr 
> 

[jira] [Updated] (SPARK-21302) history server WebUI show HTTP ERROR 500

2018-06-04 Thread Jason Pan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Pan updated SPARK-21302:
--
Description: 
When navigate to history server WebUI, and check incomplete applications, show 
http 500

Error logs:

17/07/05 20:17:44 INFO ApplicationCacheCheckFilter: Application Attempt 
app-20170705201715-0005-0ce78623-38db-4d23-a2b2-8cb45bb3f505/None updated; 
refreshing
 17/07/05 20:17:44 WARN ServletHandler: 
/history/app-20170705201715-0005-0ce78623-38db-4d23-a2b2-8cb45bb3f505/executors/
 java.lang.NullPointerException
 at 
org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
 at 
org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
 at 
org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
 at 
org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
 at 
org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
 at 
org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
 at 
org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
 at 
org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
 at org.spark_project.jetty.server.Server.handle(Server.java:499)
 at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
 at 
org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
 at 
org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
 at 
org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
 at 
org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
 at java.lang.Thread.run(Thread.java:785)
 17/07/05 20:18:00 WARN ServletHandler: /
 java.lang.NullPointerException
 at 
org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
 at 
org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
 at 
org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
 at 
org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
 at 
org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
 at 
org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
 at 
org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:479)
 at 
org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
 at 
org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
 at org.spark_project.jetty.server.Server.handle(Server.java:499)
 at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
 at 
org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
 at 
org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
 at 
org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
 at 
org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
 at java.lang.Thread.run(Thread.java:785)
 17/07/05 20:18:17 WARN ServletHandler: /
 java.lang.NullPointerException
 at 
org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
 at 
org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
 at 
org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)

 

  was:
When navigate to history server WebUI, and check incomplete applications, show 
http 500

Error logs:

17/07/05 20:17:44 INFO ApplicationCacheCheckFilter: Application Attempt 
app-20170705201715-0005-0ce78623-38db-4d23-a2b2-8cb45bb3f505/None updated; 
refreshing
17/07/05 20:17:44 WARN ServletHandler: 
/history/app-20170705201715-0005-0ce78623-38db-4d23-a2b2-8cb45bb3f505/executors/
java.lang.NullPointerException
at 
org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at 
org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
at 
org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
at 
org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at 
org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
at 
org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
at 
org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.spark_project.jetty.server.Server.handle(Se

[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive

2018-06-04 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16501225#comment-16501225
 ] 

Saisai Shao commented on SPARK-20202:
-

OK, for the 1st, I've already started working on it locally. Looks like it is 
not a big change, only some POM changes are enough, I will submit a patch to 
Hive community.

> Remove references to org.spark-project.hive
> ---
>
> Key: SPARK-20202
> URL: https://issues.apache.org/jira/browse/SPARK-20202
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.6.4, 2.0.3, 2.1.1
>Reporter: Owen O'Malley
>Priority: Major
>
> Spark can't continue to depend on their fork of Hive and must move to 
> standard Hive versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24466) TextSocketMicroBatchReader no longer works with nc utility

2018-06-04 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16501216#comment-16501216
 ] 

Jungtaek Lim commented on SPARK-24466:
--

I'm working on this. Will provide the patch sooner.

> TextSocketMicroBatchReader no longer works with nc utility
> --
>
> Key: SPARK-24466
> URL: https://issues.apache.org/jira/browse/SPARK-24466
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> While playing with Spark 2.4.0-SNAPSHOT, I found nc command exits before 
> reading actual data so the query also exits with error.
>  
> The reason is due to launching temporary reader for reading schema, and 
> closing reader, and re-opening reader. While reliable socket server should be 
> able to handle this without any issue, nc command normally can't handle 
> multiple connections and simply exits when closing temporary reader.
>  
> Given that socket source is expected to be used from examples on official 
> document or some experiments, which we tend to simply use netcat, this is 
> better to be treated as bug, though this is a kind of limitation on netcat.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24466) TextSocketMicroBatchReader no longer works with nc utility

2018-06-04 Thread Jungtaek Lim (JIRA)
Jungtaek Lim created SPARK-24466:


 Summary: TextSocketMicroBatchReader no longer works with nc utility
 Key: SPARK-24466
 URL: https://issues.apache.org/jira/browse/SPARK-24466
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Jungtaek Lim


While playing with Spark 2.4.0-SNAPSHOT, I found nc command exits before 
reading actual data so the query also exits with error.
 
The reason is due to launching temporary reader for reading schema, and closing 
reader, and re-opening reader. While reliable socket server should be able to 
handle this without any issue, nc command normally can't handle multiple 
connections and simply exits when closing temporary reader.
 
Given that socket source is expected to be used from examples on official 
document or some experiments, which we tend to simply use netcat, this is 
better to be treated as bug, though this is a kind of limitation on netcat.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24403) reuse r worker

2018-06-04 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-24403.
--
Resolution: Duplicate

> reuse r worker
> --
>
> Key: SPARK-24403
> URL: https://issues.apache.org/jira/browse/SPARK-24403
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Deepansh
>Priority: Major
>  Labels: sparkR
>
> Currently, SparkR doesn't support reuse of its workers, so broadcast and 
> closure are transferred to workers each time. Can we add the idea of python 
> worker reuse to SparkR also, to enhance its performance?
> performance issues reference 
> [https://issues.apache.org/jira/browse/SPARK-23650]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-24403) reuse r worker

2018-06-04 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung closed SPARK-24403.


> reuse r worker
> --
>
> Key: SPARK-24403
> URL: https://issues.apache.org/jira/browse/SPARK-24403
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Deepansh
>Priority: Major
>  Labels: sparkR
>
> Currently, SparkR doesn't support reuse of its workers, so broadcast and 
> closure are transferred to workers each time. Can we add the idea of python 
> worker reuse to SparkR also, to enhance its performance?
> performance issues reference 
> [https://issues.apache.org/jira/browse/SPARK-23650]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16451) Spark-shell / pyspark should finish gracefully when "SaslException: GSS initiate failed" is hit

2018-06-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-16451.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21368
[https://github.com/apache/spark/pull/21368]

> Spark-shell / pyspark should finish gracefully when "SaslException: GSS 
> initiate failed" is hit
> ---
>
> Key: SPARK-16451
> URL: https://issues.apache.org/jira/browse/SPARK-16451
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Yesha Vora
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 2.4.0
>
>
> Steps to reproduce: (secure cluster)
> * kdestroy
> * spark-shell --master yarn-client
> If no valid keytab is set while running spark-shell/pyspark, the spark client 
> never exits. It keep printing below error messages. 
> spark-client should call shutdown hook immediately and exit with proper error 
> code.
> Currently, user need to explicitly shutdown process. (using cntrl+c)
> {code}
> 16/07/08 20:53:10 WARN Client: Exception encountered while connecting to the 
> server : 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
>   at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:413)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:595)
>   at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:397)
>   at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:761)
>   at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:757)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:756)
>   at org.apache.hadoop.ipc.Client$Connection.access$3200(Client.java:397)
>   at org.apache.hadoop.ipc.Client.getConnection(Client.java:1617)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1448)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at com.sun.proxy.$Proxy25.getFileInfo(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:816)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>   at com.sun.proxy.$Proxy26.getFileInfo(Unknown Source)
>   at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2151)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1408)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1404)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1404)
>   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1437)
>   at 
> org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter.(FileSystemTimelineWriter.java:124)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.createTimelineWriter(TimelineClientImpl.java:316)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceStart(TimelineClientImpl.java:308)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStart(YarnClientImpl.java:194)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:127)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.s

[jira] [Assigned] (SPARK-16451) Spark-shell / pyspark should finish gracefully when "SaslException: GSS initiate failed" is hit

2018-06-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-16451:


Assignee: Marcelo Vanzin

> Spark-shell / pyspark should finish gracefully when "SaslException: GSS 
> initiate failed" is hit
> ---
>
> Key: SPARK-16451
> URL: https://issues.apache.org/jira/browse/SPARK-16451
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Yesha Vora
>Assignee: Marcelo Vanzin
>Priority: Major
>
> Steps to reproduce: (secure cluster)
> * kdestroy
> * spark-shell --master yarn-client
> If no valid keytab is set while running spark-shell/pyspark, the spark client 
> never exits. It keep printing below error messages. 
> spark-client should call shutdown hook immediately and exit with proper error 
> code.
> Currently, user need to explicitly shutdown process. (using cntrl+c)
> {code}
> 16/07/08 20:53:10 WARN Client: Exception encountered while connecting to the 
> server : 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
>   at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:413)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:595)
>   at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:397)
>   at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:761)
>   at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:757)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:756)
>   at org.apache.hadoop.ipc.Client$Connection.access$3200(Client.java:397)
>   at org.apache.hadoop.ipc.Client.getConnection(Client.java:1617)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1448)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at com.sun.proxy.$Proxy25.getFileInfo(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:816)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>   at com.sun.proxy.$Proxy26.getFileInfo(Unknown Source)
>   at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2151)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1408)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1404)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1404)
>   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1437)
>   at 
> org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter.(FileSystemTimelineWriter.java:124)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.createTimelineWriter(TimelineClientImpl.java:316)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceStart(TimelineClientImpl.java:308)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStart(YarnClientImpl.java:194)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:127)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
>   at org.apache.spark.SparkContext.(SparkContext.scala:530)
>   at 
> org.apache.spark.repl.SparkILoop.createS

[jira] [Assigned] (SPARK-24215) Implement __repr__ and _repr_html_ for dataframes in PySpark

2018-06-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24215:


Assignee: Li Yuanjian

> Implement __repr__ and _repr_html_ for dataframes in PySpark
> 
>
> Key: SPARK-24215
> URL: https://issues.apache.org/jira/browse/SPARK-24215
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Li Yuanjian
>Priority: Major
> Fix For: 2.4.0
>
>
> To help people that are new to Spark get feedback more easily, we should 
> implement the repr methods for Jupyter python kernels. That way, when users 
> run pyspark in jupyter console or notebooks, they get good feedback about the 
> queries they've defined.
> This should include an option for eager evaluation, (maybe 
> spark.jupyter.eager-eval?). When set, the formatting methods would run 
> dataframes and produce output like {{show}}. This is a good balance between 
> not hiding Spark's action behavior and getting feedback to users that don't 
> know to call actions.
> Here's the dev list thread for context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/eager-execution-and-debuggability-td23928.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24215) Implement __repr__ and _repr_html_ for dataframes in PySpark

2018-06-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24215.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21370
[https://github.com/apache/spark/pull/21370]

> Implement __repr__ and _repr_html_ for dataframes in PySpark
> 
>
> Key: SPARK-24215
> URL: https://issues.apache.org/jira/browse/SPARK-24215
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Li Yuanjian
>Priority: Major
> Fix For: 2.4.0
>
>
> To help people that are new to Spark get feedback more easily, we should 
> implement the repr methods for Jupyter python kernels. That way, when users 
> run pyspark in jupyter console or notebooks, they get good feedback about the 
> queries they've defined.
> This should include an option for eager evaluation, (maybe 
> spark.jupyter.eager-eval?). When set, the formatting methods would run 
> dataframes and produce output like {{show}}. This is a good balance between 
> not hiding Spark's action behavior and getting feedback to users that don't 
> know to call actions.
> Here's the dev list thread for context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/eager-execution-and-debuggability-td23928.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24465) LSHModel should support Structured Streaming for transform

2018-06-04 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-24465:
--
Description: 
Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, 
MinHashLSHModel) are not compatible with Structured Streaming (and I believe 
are the final Transformers which are not compatible).  These do not work 
because Spark SQL does not support nested types containing UDTs; see 
[SPARK-12878].

This task is to add unit tests for streaming (as in [SPARK-22644]) for 
LSHModels after [SPARK-12878] has been fixed.

> LSHModel should support Structured Streaming for transform
> --
>
> Key: SPARK-24465
> URL: https://issues.apache.org/jira/browse/SPARK-24465
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, 
> MinHashLSHModel) are not compatible with Structured Streaming (and I believe 
> are the final Transformers which are not compatible).  These do not work 
> because Spark SQL does not support nested types containing UDTs; see 
> [SPARK-12878].
> This task is to add unit tests for streaming (as in [SPARK-22644]) for 
> LSHModels after [SPARK-12878] has been fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24465) LSHModel should support Structured Streaming for transform

2018-06-04 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-24465:
-

 Summary: LSHModel should support Structured Streaming for transform
 Key: SPARK-24465
 URL: https://issues.apache.org/jira/browse/SPARK-24465
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.4.0
 Environment: Locality Sensitive Hashing (LSH) Models 
(BucketedRandomProjectionLSHModel, MinHashLSHModel) are not compatible with 
Structured Streaming (and I believe are the final Transformers which are not 
compatible).  These do not work because Spark SQL does not support nested types 
containing UDTs; see [SPARK-12878].

This task is to add unit tests for streaming (as in [SPARK-22644]) for 
LSHModels after [SPARK-12878] has been fixed.
Reporter: Joseph K. Bradley






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24465) LSHModel should support Structured Streaming for transform

2018-06-04 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-24465:
--
Environment: (was: Locality Sensitive Hashing (LSH) Models 
(BucketedRandomProjectionLSHModel, MinHashLSHModel) are not compatible with 
Structured Streaming (and I believe are the final Transformers which are not 
compatible).  These do not work because Spark SQL does not support nested types 
containing UDTs; see [SPARK-12878].

This task is to add unit tests for streaming (as in [SPARK-22644]) for 
LSHModels after [SPARK-12878] has been fixed.)

> LSHModel should support Structured Streaming for transform
> --
>
> Key: SPARK-24465
> URL: https://issues.apache.org/jira/browse/SPARK-24465
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24375) Design sketch: support barrier scheduling in Apache Spark

2018-06-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16501049#comment-16501049
 ] 

Apache Spark commented on SPARK-24375:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/21494

> Design sketch: support barrier scheduling in Apache Spark
> -
>
> Key: SPARK-24375
> URL: https://issues.apache.org/jira/browse/SPARK-24375
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Jiang Xingbo
>Priority: Major
>
> This task is to outline a design sketch for the barrier scheduling SPIP 
> discussion. It doesn't need to be a complete design before the vote. But it 
> should at least cover both Scala/Java and PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24375) Design sketch: support barrier scheduling in Apache Spark

2018-06-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24375:


Assignee: Jiang Xingbo  (was: Apache Spark)

> Design sketch: support barrier scheduling in Apache Spark
> -
>
> Key: SPARK-24375
> URL: https://issues.apache.org/jira/browse/SPARK-24375
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Jiang Xingbo
>Priority: Major
>
> This task is to outline a design sketch for the barrier scheduling SPIP 
> discussion. It doesn't need to be a complete design before the vote. But it 
> should at least cover both Scala/Java and PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24375) Design sketch: support barrier scheduling in Apache Spark

2018-06-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24375:


Assignee: Apache Spark  (was: Jiang Xingbo)

> Design sketch: support barrier scheduling in Apache Spark
> -
>
> Key: SPARK-24375
> URL: https://issues.apache.org/jira/browse/SPARK-24375
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Major
>
> This task is to outline a design sketch for the barrier scheduling SPIP 
> discussion. It doesn't need to be a complete design before the vote. But it 
> should at least cover both Scala/Java and PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-06-04 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16501037#comment-16501037
 ] 

Shivaram Venkataraman commented on SPARK-24359:
---

If you have a separate repo that makes it much more cleaner to tag SparkML 
releases and test that it works with the existing Spark releases. Say by 
tagging them as 2.4.0.1 and 2.4.0.2 etc. for every small change that needs to 
be made on the R side. If they are in the same repo then the tag will apply to 
all other Spark changes at that point making it harder to separate out just the 
R changes that went into this tag.

Also this separate repo does not need to be permanent. If we find that the 
package is stable on CRAN then we can move it back into the main repo. I just 
think for the first few releases on CRAN it'll be much more easier if its not 
tied to Spark releases. 

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., {{spark_logistic_regression()}} and 
> {{set_max_iter()}}). If a constructor gets arguments, they will be named 
> arguments. For example:
> {code:java}
> > lr <- set_reg_param(set

[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2018-06-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16501017#comment-16501017
 ] 

Apache Spark commented on SPARK-15784:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/21493

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>Assignee: Miao Wang
>Priority: Major
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2018-06-04 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16501020#comment-16501020
 ] 

Weichen Xu commented on SPARK-15784:


[~wm624] Thanks for your enthusiasm, but we need this to be done ASAP, so I 
create a PR.

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>Assignee: Miao Wang
>Priority: Major
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24300) generateLDAData in ml.cluster.LDASuite didn't set seed correctly

2018-06-04 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-24300.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21492
[https://github.com/apache/spark/pull/21492]

> generateLDAData in ml.cluster.LDASuite didn't set seed correctly
> 
>
> Key: SPARK-24300
> URL: https://issues.apache.org/jira/browse/SPARK-24300
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Xiangrui Meng
>Assignee: Lu Wang
>Priority: Minor
> Fix For: 2.4.0
>
>
> [https://github.com/apache/spark/blob/0d63ebd17df747fb41d7ba254718bb7af3ae/mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala]
>  
> generateLDAData uses the same RNG in all partitions to generate random data. 
> This either causes duplicate rows in cluster mode or indeterministic behavior 
> in local mode:
> {code:java}
> scala> val rng = new java.util.Random(10)
> rng: java.util.Random = java.util.Random@78c5ef58
> scala> sc.parallelize(1 to 10).map { i => Seq.fill(10)(rng.nextInt(10)) 
> }.collect().mkString("\n")
> res12: String =
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8){code}
> We should create one RNG per partition to make it safe.
>  
> cc: [~lu.DB] [~josephkb]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24464) Unit tests for MLlib's Instrumentation

2018-06-04 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-24464:
-

 Summary: Unit tests for MLlib's Instrumentation
 Key: SPARK-24464
 URL: https://issues.apache.org/jira/browse/SPARK-24464
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.3.0
Reporter: Xiangrui Meng


We added Instrumentation to MLlib to log params and metrics during machine 
learning training and inference. However, the code has zero test coverage, 
which usually means bugs and regressions in the future. I created this JIRA to 
discuss how we should test Instrumentation.

cc: [~thunterdb] [~josephkb] [~lu.DB]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24463) Add catalyst rule to reorder TypedFilters separated by Filters to reduce serde operations

2018-06-04 Thread Lior Regev (JIRA)
Lior Regev created SPARK-24463:
--

 Summary: Add catalyst rule to reorder TypedFilters separated by 
Filters to reduce serde operations
 Key: SPARK-24463
 URL: https://issues.apache.org/jira/browse/SPARK-24463
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Lior Regev






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19860) DataFrame join get conflict error if two frames has a same name column.

2018-06-04 Thread Jiachen Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500949#comment-16500949
 ] 

Jiachen Yang commented on SPARK-19860:
--

May I know why this happens? I come across the same problem.

> DataFrame join get conflict error if two frames has a same name column.
> ---
>
> Key: SPARK-19860
> URL: https://issues.apache.org/jira/browse/SPARK-19860
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: wuchang
>Priority: Major
>
> {code}
> >>> print df1.collect()
> [Row(fdate=u'20170223', in_amount1=7758588), Row(fdate=u'20170302', 
> in_amount1=7656414), Row(fdate=u'20170207', in_amount1=7836305), 
> Row(fdate=u'20170208', in_amount1=14887432), Row(fdate=u'20170224', 
> in_amount1=16506043), Row(fdate=u'20170201', in_amount1=7339381), 
> Row(fdate=u'20170221', in_amount1=7490447), Row(fdate=u'20170303', 
> in_amount1=11142114), Row(fdate=u'20170202', in_amount1=7882746), 
> Row(fdate=u'20170306', in_amount1=12977822), Row(fdate=u'20170227', 
> in_amount1=15480688), Row(fdate=u'20170206', in_amount1=11370812), 
> Row(fdate=u'20170217', in_amount1=8208985), Row(fdate=u'20170203', 
> in_amount1=8175477), Row(fdate=u'20170222', in_amount1=11032303), 
> Row(fdate=u'20170216', in_amount1=11986702), Row(fdate=u'20170209', 
> in_amount1=9082380), Row(fdate=u'20170214', in_amount1=8142569), 
> Row(fdate=u'20170307', in_amount1=11092829), Row(fdate=u'20170213', 
> in_amount1=12341887), Row(fdate=u'20170228', in_amount1=13966203), 
> Row(fdate=u'20170220', in_amount1=9397558), Row(fdate=u'20170210', 
> in_amount1=8205431), Row(fdate=u'20170215', in_amount1=7070829), 
> Row(fdate=u'20170301', in_amount1=10159653)]
> >>> print df2.collect()
> [Row(fdate=u'20170223', in_amount2=7072120), Row(fdate=u'20170302', 
> in_amount2=5548515), Row(fdate=u'20170207', in_amount2=5451110), 
> Row(fdate=u'20170208', in_amount2=4483131), Row(fdate=u'20170224', 
> in_amount2=9674888), Row(fdate=u'20170201', in_amount2=3227502), 
> Row(fdate=u'20170221', in_amount2=5084800), Row(fdate=u'20170303', 
> in_amount2=20577801), Row(fdate=u'20170202', in_amount2=4024218), 
> Row(fdate=u'20170306', in_amount2=8581773), Row(fdate=u'20170227', 
> in_amount2=5748035), Row(fdate=u'20170206', in_amount2=7330154), 
> Row(fdate=u'20170217', in_amount2=6838105), Row(fdate=u'20170203', 
> in_amount2=9390262), Row(fdate=u'20170222', in_amount2=3800662), 
> Row(fdate=u'20170216', in_amount2=4338891), Row(fdate=u'20170209', 
> in_amount2=4024611), Row(fdate=u'20170214', in_amount2=4030389), 
> Row(fdate=u'20170307', in_amount2=5504936), Row(fdate=u'20170213', 
> in_amount2=7142428), Row(fdate=u'20170228', in_amount2=8618951), 
> Row(fdate=u'20170220', in_amount2=8172290), Row(fdate=u'20170210', 
> in_amount2=8411312), Row(fdate=u'20170215', in_amount2=5302422), 
> Row(fdate=u'20170301', in_amount2=9475418)]
> >>> ht_net_in_df = df1.join(df2,df1.fdate == df2.fdate,'inner')
> 2017-03-08 10:27:34,357 WARN  [Thread-2] sql.Column: Constructing trivially 
> true equals predicate, 'fdate#42 = fdate#42'. Perhaps you need to use aliases.
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/home/spark/python/pyspark/sql/dataframe.py", line 652, in join
> jdf = self._jdf.join(other._jdf, on._jc, how)
>   File "/home/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 933, in __call__
>   File "/home/spark/python/pyspark/sql/utils.py", line 69, in deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u"
> Failure when resolving conflicting references in Join:
> 'Join Inner, (fdate#42 = fdate#42)
> :- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double)) 
> as int) AS in_amount1#97]
> :  +- Filter (inorout#44 = A)
> : +- Project [firm_id#40, partnerid#45, inorout#44, inoutmoney#47, 
> fdate#42]
> :+- Filter (((partnerid#45 = pmec) && NOT (firm_id#40 = NULL)) && 
> (NOT (firm_id#40 = -1) && (fdate#42 >= 20170201)))
> :   +- SubqueryAlias history_transfer_v
> :  +- Project [md5(cast(firmid#41 as binary)) AS FIRM_ID#40, 
> fdate#42, ftime#43, inorout#44, partnerid#45, realdate#46, inoutmoney#47, 
> bankwaterid#48, waterid#49, waterstate#50, source#51]
> : +- SubqueryAlias history_transfer
> :+- 
> Relation[firmid#41,fdate#42,ftime#43,inorout#44,partnerid#45,realdate#46,inoutmoney#47,bankwaterid#48,waterid#49,waterstate#50,source#51]
>  parquet
> +- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double)) 
> as int) AS in_amount2#145]
>+- Filter (inorout#44 = B)
>   +- Project [firm_id#40, partnerid#45, inorout#44, inoutmoney#47, 
> fdate#42]
>  +- 

[jira] [Resolved] (SPARK-24290) Instrumentation Improvement: add logNamedValue taking Array types

2018-06-04 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-24290.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21347
[https://github.com/apache/spark/pull/21347]

> Instrumentation Improvement: add logNamedValue taking Array types
> -
>
> Key: SPARK-24290
> URL: https://issues.apache.org/jira/browse/SPARK-24290
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Lu Wang
>Assignee: Lu Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24290) Instrumentation Improvement: add logNamedValue taking Array types

2018-06-04 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-24290:
-

Assignee: Lu Wang

> Instrumentation Improvement: add logNamedValue taking Array types
> -
>
> Key: SPARK-24290
> URL: https://issues.apache.org/jira/browse/SPARK-24290
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Lu Wang
>Assignee: Lu Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24300) generateLDAData in ml.cluster.LDASuite didn't set seed correctly

2018-06-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500909#comment-16500909
 ] 

Apache Spark commented on SPARK-24300:
--

User 'ludatabricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/21492

> generateLDAData in ml.cluster.LDASuite didn't set seed correctly
> 
>
> Key: SPARK-24300
> URL: https://issues.apache.org/jira/browse/SPARK-24300
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Xiangrui Meng
>Assignee: Lu Wang
>Priority: Minor
>
> [https://github.com/apache/spark/blob/0d63ebd17df747fb41d7ba254718bb7af3ae/mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala]
>  
> generateLDAData uses the same RNG in all partitions to generate random data. 
> This either causes duplicate rows in cluster mode or indeterministic behavior 
> in local mode:
> {code:java}
> scala> val rng = new java.util.Random(10)
> rng: java.util.Random = java.util.Random@78c5ef58
> scala> sc.parallelize(1 to 10).map { i => Seq.fill(10)(rng.nextInt(10)) 
> }.collect().mkString("\n")
> res12: String =
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8){code}
> We should create one RNG per partition to make it safe.
>  
> cc: [~lu.DB] [~josephkb]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24300) generateLDAData in ml.cluster.LDASuite didn't set seed correctly

2018-06-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24300:


Assignee: Lu Wang  (was: Apache Spark)

> generateLDAData in ml.cluster.LDASuite didn't set seed correctly
> 
>
> Key: SPARK-24300
> URL: https://issues.apache.org/jira/browse/SPARK-24300
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Xiangrui Meng
>Assignee: Lu Wang
>Priority: Minor
>
> [https://github.com/apache/spark/blob/0d63ebd17df747fb41d7ba254718bb7af3ae/mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala]
>  
> generateLDAData uses the same RNG in all partitions to generate random data. 
> This either causes duplicate rows in cluster mode or indeterministic behavior 
> in local mode:
> {code:java}
> scala> val rng = new java.util.Random(10)
> rng: java.util.Random = java.util.Random@78c5ef58
> scala> sc.parallelize(1 to 10).map { i => Seq.fill(10)(rng.nextInt(10)) 
> }.collect().mkString("\n")
> res12: String =
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8){code}
> We should create one RNG per partition to make it safe.
>  
> cc: [~lu.DB] [~josephkb]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24300) generateLDAData in ml.cluster.LDASuite didn't set seed correctly

2018-06-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24300:


Assignee: Apache Spark  (was: Lu Wang)

> generateLDAData in ml.cluster.LDASuite didn't set seed correctly
> 
>
> Key: SPARK-24300
> URL: https://issues.apache.org/jira/browse/SPARK-24300
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Minor
>
> [https://github.com/apache/spark/blob/0d63ebd17df747fb41d7ba254718bb7af3ae/mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala]
>  
> generateLDAData uses the same RNG in all partitions to generate random data. 
> This either causes duplicate rows in cluster mode or indeterministic behavior 
> in local mode:
> {code:java}
> scala> val rng = new java.util.Random(10)
> rng: java.util.Random = java.util.Random@78c5ef58
> scala> sc.parallelize(1 to 10).map { i => Seq.fill(10)(rng.nextInt(10)) 
> }.collect().mkString("\n")
> res12: String =
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8){code}
> We should create one RNG per partition to make it safe.
>  
> cc: [~lu.DB] [~josephkb]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24374) SPIP: Support Barrier Scheduling in Apache Spark

2018-06-04 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500900#comment-16500900
 ] 

Xiangrui Meng commented on SPARK-24374:
---

[~leftnoteasy] Thanks for your input! 

{quote}
This JIRA is trying to solve the gang scheduling problem of ML applications, 
however, gang scheduling should be handled by underlying resource scheduler 
instead of Spark. Because Spark with non-standalone deployment has no control 
of how to do resource allocation.
{quote}

Agree that the overall resource provision should be handled by the resource 
manager. I think this is true for standalone as well. However, inside Spark, we 
still need to fine-grained job scheduling to allocate tasks. For example, a 
skewed join might hold some task slots for quite long and hence the tasks from 
next stage have to wait to start all together. Ideally, Spark should be able to 
talk to the resource manager for better elasticity.

{quote}
If the proposed API is like to implement gang-scheduling by using 
gather-and-hold pattern, existing Spark API should be good enough – just to 
request resources until it reaches target #containers. Application needs to 
wait in both cases.
{quote}

The Spark API is not good enough for two reasons:

1) The all-reduce patten could be implemented by a single gather and broadcast, 
but driver that gathers the message would become the bottleneck when the 
message is big (~20 million features) or there are too many nodes. This is why 
we started with SPARK-1485 (all-reduce) but ended up at SPARK-2174 
(tree-reduce).

2) If we ask user program to set a barrier and wait for all tasks to be ready. 
It won't work in case of failures, because Spark will only retry the failed 
task instead of all. This requires significant code changes on users side to 
handle the failure scenario.

{quote}
MPI needs launched processes to contact their master so master can launch 
slaves and make them to interconnect to each other (phone-home). Application 
needs to implement logics to talk to different RMs.
{quote}

[~jiangxb1987] made a prototype of this scenario, not on YARN but on standalone 
to help discuss the design. In the barrier stage, users can easily get node 
infos of all tasks. So the MPI setup could be simplified. We will take a look 
at mpich2-yarn code base. Is there a design doc there? So we can quickly get 
the high-level design choices.

{quote}
One potential benefit I can think about embedding app to Spark is, applications 
could directly read from memory of Spark tasks.
{quote}

I'm preparing a SPIP doc for accelerating the data exchange between Spark and 
3rd-party frameworks. I would treat it as an orthogonal issue here. Fast data 
exchange would help the model inference use case a lot. For training, we might 
just need a standard data interface to simplify data conversions.

> SPIP: Support Barrier Scheduling in Apache Spark
> 
>
> Key: SPARK-24374
> URL: https://issues.apache.org/jira/browse/SPARK-24374
> Project: Spark
>  Issue Type: Epic
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked/attached SPIP doc.)
> {quote}
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21896) Stack Overflow when window function nested inside aggregate function

2018-06-04 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21896.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21473
[https://github.com/apache/spark/pull/21473]

> Stack Overflow when window function nested inside aggregate function
> 
>
> Key: SPARK-21896
> URL: https://issues.apache.org/jira/browse/SPARK-21896
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Luyao Yang
>Assignee: Anton Okolnychyi
>Priority: Minor
> Fix For: 2.4.0
>
>
> A minimal example: with the following simple test data
> {noformat}
> >>> df = spark.createDataFrame([(1, 2), (1, 3), (2, 4)], ['a', 'b'])
> >>> df.show()
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  2|
> |  1|  3|
> |  2|  4|
> +---+---+
> {noformat}
> This works: 
> {noformat}
> >>> w = Window().orderBy('b')
> >>> result = (df.select(F.rank().over(w).alias('rk'))
> ....groupby()
> ....agg(F.max('rk'))
> ...  )
> >>> result.show()
> +---+
> |max(rk)|
> +---+
> |  3|
> +---+
> {noformat}
> But this equivalent gives an error. Note that the error is thrown right when 
> the operation is defined, not when an action is called later:
> {noformat}
> >>> result = (df.groupby()
> ....agg(F.max(F.rank().over(w)))
> ...  )
> Traceback (most recent call last):
>   File 
> "/usr/local/lib/python2.7/site-packages/IPython/core/interactiveshell.py", 
> line 2885, in run_code
> exec(code_obj, self.user_global_ns, self.user_ns)
>   File "", line 2, in 
> .agg(F.max(F.rank().over(w)))
>   File "/usr/lib/spark/python/pyspark/sql/group.py", line 91, in agg
> _to_seq(self.sql_ctx._sc, [c._jc for c in exprs[1:]]))
>   File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
> line 1133, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
> 319, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o10789.agg.
> : java.lang.StackOverflowError
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:55)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:400)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:381)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:277)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$71.apply(Analyzer.scala:1688)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$71.apply(Analyzer.scala:1724)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$extract(Analyzer.scala:1687)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$26.applyOrElse(Analyzer.scala:1825)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$26.applyOrElse(Analyzer.scala:1800)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:287)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.map

[jira] [Assigned] (SPARK-21896) Stack Overflow when window function nested inside aggregate function

2018-06-04 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21896:
---

Assignee: Anton Okolnychyi

> Stack Overflow when window function nested inside aggregate function
> 
>
> Key: SPARK-21896
> URL: https://issues.apache.org/jira/browse/SPARK-21896
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Luyao Yang
>Assignee: Anton Okolnychyi
>Priority: Minor
> Fix For: 2.4.0
>
>
> A minimal example: with the following simple test data
> {noformat}
> >>> df = spark.createDataFrame([(1, 2), (1, 3), (2, 4)], ['a', 'b'])
> >>> df.show()
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  2|
> |  1|  3|
> |  2|  4|
> +---+---+
> {noformat}
> This works: 
> {noformat}
> >>> w = Window().orderBy('b')
> >>> result = (df.select(F.rank().over(w).alias('rk'))
> ....groupby()
> ....agg(F.max('rk'))
> ...  )
> >>> result.show()
> +---+
> |max(rk)|
> +---+
> |  3|
> +---+
> {noformat}
> But this equivalent gives an error. Note that the error is thrown right when 
> the operation is defined, not when an action is called later:
> {noformat}
> >>> result = (df.groupby()
> ....agg(F.max(F.rank().over(w)))
> ...  )
> Traceback (most recent call last):
>   File 
> "/usr/local/lib/python2.7/site-packages/IPython/core/interactiveshell.py", 
> line 2885, in run_code
> exec(code_obj, self.user_global_ns, self.user_ns)
>   File "", line 2, in 
> .agg(F.max(F.rank().over(w)))
>   File "/usr/lib/spark/python/pyspark/sql/group.py", line 91, in agg
> _to_seq(self.sql_ctx._sc, [c._jc for c in exprs[1:]]))
>   File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
> line 1133, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
> 319, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o10789.agg.
> : java.lang.StackOverflowError
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:55)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:400)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:381)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:277)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$71.apply(Analyzer.scala:1688)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$71.apply(Analyzer.scala:1724)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$extract(Analyzer.scala:1687)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$26.applyOrElse(Analyzer.scala:1825)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$26.applyOrElse(Analyzer.scala:1800)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:287)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.tr

[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2018-06-04 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500809#comment-16500809
 ] 

Xiangrui Meng commented on SPARK-15784:
---

Discussed with [~WeichenXu123] offline. I think we should change the APIs to 
the following:

{code}
class PowerIterationClustering extends Params with HasWeightCol with 
DefaultReadWrite {
  def srcCol: Param[String]
  def dstCol: Param[String]
  def wegithCol: Param[String]
  def assignClusters(dataset: Dataset[_]): DataFrame[id: Long, cluster: Int]
}
{code}

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>Assignee: Miao Wang
>Priority: Major
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24448) File not found on the address SparkFiles.get returns on standalone cluster

2018-06-04 Thread Pritpal Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500794#comment-16500794
 ] 

Pritpal Singh commented on SPARK-24448:
---

[~jerryshao] , client mode does not work too. We use standalone cluster. Is it 
an issue with standalone?

> File not found on the address SparkFiles.get returns on standalone cluster
> --
>
> Key: SPARK-24448
> URL: https://issues.apache.org/jira/browse/SPARK-24448
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Pritpal Singh
>Priority: Major
>
> I want to upload a file on all worker nodes in a standalone cluster and 
> retrieve the location of file. Here is my code
>  
> val tempKeyStoreLoc = System.getProperty("java.io.tmpdir") + "/keystore.jks"
> val file = new File(tempKeyStoreLoc)
> sparkContext.addFile(file.getAbsolutePath)
> val keyLoc = SparkFiles.get("keystore.jks")
>  
> SparkFiles.get returns a random location where keystore.jks does not exist. I 
> submit the job in cluster mode. In fact the location Spark.Files returns does 
> not exist on any of the worker nodes (including the driver node). 
> I observed that Spark does load keystore.jks files on worker nodes at 
> /work///keystore.jks. The partition_id 
> changes from one worker node to another.
> My requirement is to upload a file on all nodes of a cluster and retrieve its 
> location. I'm expecting the location to be common across all worker nodes.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24453) Fix error recovering from the failure in a no-data batch

2018-06-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24453:


Assignee: Apache Spark  (was: Tathagata Das)

> Fix error recovering from the failure in a no-data batch
> 
>
> Key: SPARK-24453
> URL: https://issues.apache.org/jira/browse/SPARK-24453
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Major
>
> ```
> java.lang.AssertionError: assertion failed: Concurrent update to the log. 
> Multiple streaming jobs detected for 159897
> ```
> The error occurs when we are recovering from a failure in a no-data batch 
> (say X) that has been planned (i.e. written to offset log) but not executed 
> (i.e. not written to commit log). Upon recovery, the following sequence of 
> events happen.
> - `MicroBatchExecution.populateStartOffsets` sets `currentBatchId` to X. 
> Since there was no data in the batch, the `availableOffsets` is same as 
> `committedOffsets`, so `isNewDataAvailable` is false.
> - When MicroBatchExecution.constructNextBatch is called, ideally it should 
> immediately return true because the next batch has already been constructed. 
> However, the check of whether the batch has been constructed was `if 
> (isNewDataAvailable) return true`. Since the planned batch is a no-data 
> batch, it escaped this check and proceeded to plan the same batch X once 
> again. And if there is new data since the failure, it does plan a new batch, 
> and try to write new offsets to the `offsetLog` as batchId X, and fail with 
> the above error.
> The correct solution is to check the offset log whether the currentBatchId is 
> the latest or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24453) Fix error recovering from the failure in a no-data batch

2018-06-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24453:


Assignee: Tathagata Das  (was: Apache Spark)

> Fix error recovering from the failure in a no-data batch
> 
>
> Key: SPARK-24453
> URL: https://issues.apache.org/jira/browse/SPARK-24453
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>
> ```
> java.lang.AssertionError: assertion failed: Concurrent update to the log. 
> Multiple streaming jobs detected for 159897
> ```
> The error occurs when we are recovering from a failure in a no-data batch 
> (say X) that has been planned (i.e. written to offset log) but not executed 
> (i.e. not written to commit log). Upon recovery, the following sequence of 
> events happen.
> - `MicroBatchExecution.populateStartOffsets` sets `currentBatchId` to X. 
> Since there was no data in the batch, the `availableOffsets` is same as 
> `committedOffsets`, so `isNewDataAvailable` is false.
> - When MicroBatchExecution.constructNextBatch is called, ideally it should 
> immediately return true because the next batch has already been constructed. 
> However, the check of whether the batch has been constructed was `if 
> (isNewDataAvailable) return true`. Since the planned batch is a no-data 
> batch, it escaped this check and proceeded to plan the same batch X once 
> again. And if there is new data since the failure, it does plan a new batch, 
> and try to write new offsets to the `offsetLog` as batchId X, and fail with 
> the above error.
> The correct solution is to check the offset log whether the currentBatchId is 
> the latest or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24453) Fix error recovering from the failure in a no-data batch

2018-06-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500770#comment-16500770
 ] 

Apache Spark commented on SPARK-24453:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/21491

> Fix error recovering from the failure in a no-data batch
> 
>
> Key: SPARK-24453
> URL: https://issues.apache.org/jira/browse/SPARK-24453
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>
> ```
> java.lang.AssertionError: assertion failed: Concurrent update to the log. 
> Multiple streaming jobs detected for 159897
> ```
> The error occurs when we are recovering from a failure in a no-data batch 
> (say X) that has been planned (i.e. written to offset log) but not executed 
> (i.e. not written to commit log). Upon recovery, the following sequence of 
> events happen.
> - `MicroBatchExecution.populateStartOffsets` sets `currentBatchId` to X. 
> Since there was no data in the batch, the `availableOffsets` is same as 
> `committedOffsets`, so `isNewDataAvailable` is false.
> - When MicroBatchExecution.constructNextBatch is called, ideally it should 
> immediately return true because the next batch has already been constructed. 
> However, the check of whether the batch has been constructed was `if 
> (isNewDataAvailable) return true`. Since the planned batch is a no-data 
> batch, it escaped this check and proceeded to plan the same batch X once 
> again. And if there is new data since the failure, it does plan a new batch, 
> and try to write new offsets to the `offsetLog` as batchId X, and fail with 
> the above error.
> The correct solution is to check the offset log whether the currentBatchId is 
> the latest or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-06-04 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500639#comment-16500639
 ] 

Reynold Xin commented on SPARK-24359:
-

Why would a separate repo lead to faster iteration? What's the difference 
between that and just a directory in mainline repo that's not part of the same 
build with the mainline repo?

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., {{spark_logistic_regression()}} and 
> {{set_max_iter()}}). If a constructor gets arguments, they will be named 
> arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 
> > 0.1){code}
> When calls need to be chained, like above example, syntax can nicely 
> translate to a natural pipeline style with help from very popular[ magrittr 
> package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
> example:
> {code:java}
> > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> 
> > lr{code}
> h2. Namespace
> All new API will be under a new CRAN package, named SparkML. The package 
> should be usable without

[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-06-04 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500634#comment-16500634
 ] 

Felix Cheung commented on SPARK-24359:
--

+1 on the `spark-website` model for faster iterations. This was my suggestion 
originally not just for releases but to publish on CRAN. But if you can get the 
package source into a state to "work" without Spark (JVM) and SparkR, then it 
will make the publication process easier.

 

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., {{spark_logistic_regression()}} and 
> {{set_max_iter()}}). If a constructor gets arguments, they will be named 
> arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 
> > 0.1){code}
> When calls need to be chained, like above example, syntax can nicely 
> translate to a natural pipeline style with help from very popular[ magrittr 
> package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
> example:
> {code:java}
> > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> 
> > lr{code}
> h2. Namespace
>

[jira] [Assigned] (SPARK-24462) Text socket micro-batch reader throws error when a query is restarted with saved state

2018-06-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24462:


Assignee: Apache Spark

> Text socket micro-batch reader throws error when a query is restarted with 
> saved state
> --
>
> Key: SPARK-24462
> URL: https://issues.apache.org/jira/browse/SPARK-24462
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Arun Mahadevan
>Assignee: Apache Spark
>Priority: Critical
>
> Exception thrown:
>  
> {noformat}
> scala> 18/06/01 22:47:04 ERROR MicroBatchExecution: Query [id = 
> 0bdc4428-5d21-4237-9d64-898ae65f28f3, runId = 
> f6822423-2bd2-47c1-8ed6-799d1c481195] terminated with error
> java.lang.RuntimeException: Offsets committed out of order: 2 followed by -1
>  at scala.sys.package$.error(package.scala:27)
>  at 
> org.apache.spark.sql.execution.streaming.sources.TextSocketMicroBatchReader.commit(socket.scala:197)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$2$$anonfun$apply$mcV$sp$5.apply(MicroBatchExecution.scala:377)
>  
> {noformat}
>  
> Sample code that reproduces the error on restarting the query.
>  
> {code:java}
>  
> import java.sql.Timestamp
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.functions._
> import spark.implicits._
> import org.apache.spark.sql.streaming.Trigger
> val lines = spark.readStream.format("socket").option("host", 
> "localhost").option("port", ).option("includeTimestamp", true).load()
> val words = lines.as[(String, Timestamp)].flatMap(line => line._1.split(" 
> ").map(word => (word, line._2))).toDF("word", "timestamp")
> val windowedCounts = words.groupBy(window($"timestamp", "20 minutes", "20 
> minutes"), $"word").count().orderBy("window")
> val query = 
> windowedCounts.writeStream.outputMode("complete").option("checkpointLocation",
>  "/tmp/debug").format("console").option("truncate", "false").start()
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24462) Text socket micro-batch reader throws error when a query is restarted with saved state

2018-06-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24462:


Assignee: (was: Apache Spark)

> Text socket micro-batch reader throws error when a query is restarted with 
> saved state
> --
>
> Key: SPARK-24462
> URL: https://issues.apache.org/jira/browse/SPARK-24462
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Arun Mahadevan
>Priority: Critical
>
> Exception thrown:
>  
> {noformat}
> scala> 18/06/01 22:47:04 ERROR MicroBatchExecution: Query [id = 
> 0bdc4428-5d21-4237-9d64-898ae65f28f3, runId = 
> f6822423-2bd2-47c1-8ed6-799d1c481195] terminated with error
> java.lang.RuntimeException: Offsets committed out of order: 2 followed by -1
>  at scala.sys.package$.error(package.scala:27)
>  at 
> org.apache.spark.sql.execution.streaming.sources.TextSocketMicroBatchReader.commit(socket.scala:197)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$2$$anonfun$apply$mcV$sp$5.apply(MicroBatchExecution.scala:377)
>  
> {noformat}
>  
> Sample code that reproduces the error on restarting the query.
>  
> {code:java}
>  
> import java.sql.Timestamp
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.functions._
> import spark.implicits._
> import org.apache.spark.sql.streaming.Trigger
> val lines = spark.readStream.format("socket").option("host", 
> "localhost").option("port", ).option("includeTimestamp", true).load()
> val words = lines.as[(String, Timestamp)].flatMap(line => line._1.split(" 
> ").map(word => (word, line._2))).toDF("word", "timestamp")
> val windowedCounts = words.groupBy(window($"timestamp", "20 minutes", "20 
> minutes"), $"word").count().orderBy("window")
> val query = 
> windowedCounts.writeStream.outputMode("complete").option("checkpointLocation",
>  "/tmp/debug").format("console").option("truncate", "false").start()
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24462) Text socket micro-batch reader throws error when a query is restarted with saved state

2018-06-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500606#comment-16500606
 ] 

Apache Spark commented on SPARK-24462:
--

User 'arunmahadevan' has created a pull request for this issue:
https://github.com/apache/spark/pull/21490

> Text socket micro-batch reader throws error when a query is restarted with 
> saved state
> --
>
> Key: SPARK-24462
> URL: https://issues.apache.org/jira/browse/SPARK-24462
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Arun Mahadevan
>Priority: Critical
>
> Exception thrown:
>  
> {noformat}
> scala> 18/06/01 22:47:04 ERROR MicroBatchExecution: Query [id = 
> 0bdc4428-5d21-4237-9d64-898ae65f28f3, runId = 
> f6822423-2bd2-47c1-8ed6-799d1c481195] terminated with error
> java.lang.RuntimeException: Offsets committed out of order: 2 followed by -1
>  at scala.sys.package$.error(package.scala:27)
>  at 
> org.apache.spark.sql.execution.streaming.sources.TextSocketMicroBatchReader.commit(socket.scala:197)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$2$$anonfun$apply$mcV$sp$5.apply(MicroBatchExecution.scala:377)
>  
> {noformat}
>  
> Sample code that reproduces the error on restarting the query.
>  
> {code:java}
>  
> import java.sql.Timestamp
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.functions._
> import spark.implicits._
> import org.apache.spark.sql.streaming.Trigger
> val lines = spark.readStream.format("socket").option("host", 
> "localhost").option("port", ).option("includeTimestamp", true).load()
> val words = lines.as[(String, Timestamp)].flatMap(line => line._1.split(" 
> ").map(word => (word, line._2))).toDF("word", "timestamp")
> val windowedCounts = words.groupBy(window($"timestamp", "20 minutes", "20 
> minutes"), $"word").count().orderBy("window")
> val query = 
> windowedCounts.writeStream.outputMode("complete").option("checkpointLocation",
>  "/tmp/debug").format("console").option("truncate", "false").start()
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24462) Text socket micro-batch reader throws error when a query is restarted with saved state

2018-06-04 Thread Arun Mahadevan (JIRA)
Arun Mahadevan created SPARK-24462:
--

 Summary: Text socket micro-batch reader throws error when a query 
is restarted with saved state
 Key: SPARK-24462
 URL: https://issues.apache.org/jira/browse/SPARK-24462
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Arun Mahadevan


Exception thrown:

 
{noformat}
scala> 18/06/01 22:47:04 ERROR MicroBatchExecution: Query [id = 
0bdc4428-5d21-4237-9d64-898ae65f28f3, runId = 
f6822423-2bd2-47c1-8ed6-799d1c481195] terminated with error
java.lang.RuntimeException: Offsets committed out of order: 2 followed by -1
 at scala.sys.package$.error(package.scala:27)
 at 
org.apache.spark.sql.execution.streaming.sources.TextSocketMicroBatchReader.commit(socket.scala:197)
 at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$2$$anonfun$apply$mcV$sp$5.apply(MicroBatchExecution.scala:377)
 
{noformat}
 

Sample code that reproduces the error on restarting the query.

 
{code:java}
 

import java.sql.Timestamp
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import spark.implicits._
import org.apache.spark.sql.streaming.Trigger

val lines = spark.readStream.format("socket").option("host", 
"localhost").option("port", ).option("includeTimestamp", true).load()

val words = lines.as[(String, Timestamp)].flatMap(line => line._1.split(" 
").map(word => (word, line._2))).toDF("word", "timestamp")

val windowedCounts = words.groupBy(window($"timestamp", "20 minutes", "20 
minutes"), $"word").count().orderBy("window")

val query = 
windowedCounts.writeStream.outputMode("complete").option("checkpointLocation", 
"/tmp/debug").format("console").option("truncate", "false").start()


{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive

2018-06-04 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500560#comment-16500560
 ] 

Steve Loughran commented on SPARK-20202:


I think you could split things into two

# a modified hive 1.2.1.x for hadoop 3, with a new package name in the maven 
builds. (joy, profiles!) and some work with the hive team to get this 
officially published by them. Strength: easy for people to backport into 
shipping 2.2, 2.3 builds just by changing the POM

# the bigger move to Hive 2. This will be the best for future, but is bound to 
have more surprises. There's even the possibility that the hive team might have 
to make some changes too, which isn't impossible if the timelines line up.

> Remove references to org.spark-project.hive
> ---
>
> Key: SPARK-20202
> URL: https://issues.apache.org/jira/browse/SPARK-20202
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.6.4, 2.0.3, 2.1.1
>Reporter: Owen O'Malley
>Priority: Major
>
> Spark can't continue to depend on their fork of Hive and must move to 
> standard Hive versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23903) Add support for date extract

2018-06-04 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-23903.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21479
[https://github.com/apache/spark/pull/21479]

> Add support for date extract
> 
>
> Key: SPARK-23903
> URL: https://issues.apache.org/jira/browse/SPARK-23903
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> Heavily used in timeseries based datasets. 
> https://www.postgresql.org/docs/9.1/static/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT
> {noformat}
> EXTRACT(field FROM source)
> {noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23903) Add support for date extract

2018-06-04 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-23903:
-

Assignee: Yuming Wang

> Add support for date extract
> 
>
> Key: SPARK-23903
> URL: https://issues.apache.org/jira/browse/SPARK-23903
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Yuming Wang
>Priority: Major
>
> Heavily used in timeseries based datasets. 
> https://www.postgresql.org/docs/9.1/static/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT
> {noformat}
> EXTRACT(field FROM source)
> {noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24461) Snapshot Cache

2018-06-04 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500539#comment-16500539
 ] 

Xiao Li commented on SPARK-24461:
-

cc [~maryannxue] 

> Snapshot Cache
> --
>
> Key: SPARK-24461
> URL: https://issues.apache.org/jira/browse/SPARK-24461
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> In some usage scenarios, data staleness is not critical. We can introduce a 
> snapshot cache of the original data for achieving much better performance. 
> Different from the current cache, it is resolved by names instead of by plan 
> matching. Cache rebuild can be manually or by events. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24461) Snapshot Cache

2018-06-04 Thread Xiao Li (JIRA)
Xiao Li created SPARK-24461:
---

 Summary: Snapshot Cache
 Key: SPARK-24461
 URL: https://issues.apache.org/jira/browse/SPARK-24461
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.3.0
Reporter: Xiao Li


In some usage scenarios, data staleness is not critical. We can introduce a 
snapshot cache of the original data for achieving much better performance. 
Different from the current cache, it is resolved by names instead of by plan 
matching. Cache rebuild can be manually or by events. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24460) exactly-once mode

2018-06-04 Thread Jose Torres (JIRA)
Jose Torres created SPARK-24460:
---

 Summary: exactly-once mode
 Key: SPARK-24460
 URL: https://issues.apache.org/jira/browse/SPARK-24460
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Jose Torres


Major changes we know will need to be made:
 * Restart strategy needs to replay offsets already in the log as microbatches.
 * Some kind of epoch alignment - need to think about this further.

We'll also need a test plan to ensure that we've actually achieved exactly once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24459) watermarks

2018-06-04 Thread Jose Torres (JIRA)
Jose Torres created SPARK-24459:
---

 Summary: watermarks
 Key: SPARK-24459
 URL: https://issues.apache.org/jira/browse/SPARK-24459
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Jose Torres






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24210) incorrect handling of boolean expressions when using column in expressions in pyspark.sql.DataFrame filter function

2018-06-04 Thread Li Yuanjian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500383#comment-16500383
 ] 

Li Yuanjian edited comment on SPARK-24210 at 6/4/18 3:34 PM:
-

I think it maybe not a bug.
{code:python}
#KO:
returns r1 and r3ex.filter(('c1 = 1') and ('c2 = 1')).show()
{code}
This cause by python self base string __and__ implementation. After passing to 
df.filter, there's only 'c2 = 1'.
{code:python}
#KO:
returns r0 and r3ex.filter('c1 = 1 & c2 = 1').show()
#KO:
returns r0 and r3ex.filter('c1 == 1 & c2 == 1').show()
{code}
As you mentioned, [https://github.com/apache/spark/pull/6961] actually fix the 
'&' between column, but not string expression like 'c1 = 1 & c2 = 1', here in 
ex.filter('c1 = 1 & c2 = 1'), Spark parse it to valueExpression like: 'Filter 
(('a = (1 & 'b)) = 1), I think this make sense here. 


was (Author: xuanyuan):
I think it maybe not a bug.
#KO: returns r1 and r3ex.filter(('c1 = 1') and ('c2 = 1')).show()
This cause by python self base string __and__ implementation. After passing to 
df.filter, there's only 'c2 = 1'.
#KO: returns r0 and r3ex.filter('c1 = 1 & c2 = 1').show()#KO: returns r0 and 
r3ex.filter('c1 == 1 & c2 == 1').show()
As you mentioned, [https://github.com/apache/spark/pull/6961] actually fix the 
'&' between column, but not string expression like 'c1 = 1 & c2 = 1', here in 
ex.filter('c1 = 1 & c2 = 1'), Spark parse it to valueExpression like: 'Filter 
(('a = (1 & 'b)) = 1), I think this make sense here. 

> incorrect handling of boolean expressions when using column in expressions in 
> pyspark.sql.DataFrame filter function
> ---
>
> Key: SPARK-24210
> URL: https://issues.apache.org/jira/browse/SPARK-24210
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.2
>Reporter: Michael H
>Priority: Major
>
> {code:python}
> ex = spark.createDataFrame([
> ('r0', 0, 0),
> ('r1', 0, 1),
> ('r2', 1, 0),
> ('r3', 1, 1)]\
>   , "row: string, c1: int, c2: int")
> #KO: returns r1 and r3
> ex.filter(('c1 = 1') and ('c2 = 1')).show()
> #OK, raises an exception
> ex.filter(('c1 == 1') & ('c2 == 1')).show()
> #KO: returns r0 and r3
> ex.filter('c1 = 1 & c2 = 1').show()
> #KO: returns r0 and r3
> ex.filter('c1 == 1 & c2 == 1').show()
> #OK: returns r3 only
> ex.filter('c1 = 1 and c2 = 1').show()
> #OK: returns r3 only
> ex.filter('c1 == 1 and c2 == 1').show()
> {code}
> building the expressions using {code}ex.c1{code} or {code}ex['c1']{code} we 
> don't have this.
> Issue seems related with
> https://github.com/apache/spark/pull/6961



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24210) incorrect handling of boolean expressions when using column in expressions in pyspark.sql.DataFrame filter function

2018-06-04 Thread Li Yuanjian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500383#comment-16500383
 ] 

Li Yuanjian commented on SPARK-24210:
-

I think it maybe not a bug.
#KO: returns r1 and r3ex.filter(('c1 = 1') and ('c2 = 1')).show()
This cause by python self base string __and__ implementation. After passing to 
df.filter, there's only 'c2 = 1'.
#KO: returns r0 and r3ex.filter('c1 = 1 & c2 = 1').show()#KO: returns r0 and 
r3ex.filter('c1 == 1 & c2 == 1').show()
As you mentioned, [https://github.com/apache/spark/pull/6961] actually fix the 
'&' between column, but not string expression like 'c1 = 1 & c2 = 1', here in 
ex.filter('c1 = 1 & c2 = 1'), Spark parse it to valueExpression like: 'Filter 
(('a = (1 & 'b)) = 1), I think this make sense here. 

> incorrect handling of boolean expressions when using column in expressions in 
> pyspark.sql.DataFrame filter function
> ---
>
> Key: SPARK-24210
> URL: https://issues.apache.org/jira/browse/SPARK-24210
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.2
>Reporter: Michael H
>Priority: Major
>
> {code:python}
> ex = spark.createDataFrame([
> ('r0', 0, 0),
> ('r1', 0, 1),
> ('r2', 1, 0),
> ('r3', 1, 1)]\
>   , "row: string, c1: int, c2: int")
> #KO: returns r1 and r3
> ex.filter(('c1 = 1') and ('c2 = 1')).show()
> #OK, raises an exception
> ex.filter(('c1 == 1') & ('c2 == 1')).show()
> #KO: returns r0 and r3
> ex.filter('c1 = 1 & c2 = 1').show()
> #KO: returns r0 and r3
> ex.filter('c1 == 1 & c2 == 1').show()
> #OK: returns r3 only
> ex.filter('c1 = 1 and c2 = 1').show()
> #OK: returns r3 only
> ex.filter('c1 == 1 and c2 == 1').show()
> {code}
> building the expressions using {code}ex.c1{code} or {code}ex['c1']{code} we 
> don't have this.
> Issue seems related with
> https://github.com/apache/spark/pull/6961



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10409) Multilayer perceptron regression

2018-06-04 Thread Angel Conde (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500278#comment-16500278
 ] 

Angel Conde  commented on SPARK-10409:
--

Another important use case for the MLP regressor is for forecasting machine 
sensor data. One of our clients uses this approach for predictive maintenance 
on Industry 4.0. assets. We were hoping to replace their custom implementation 
using ad-hoc library with Spark ML implementation but we are blocked until this 
get merged.

Can't go into too much detail about the use case, but it's in production in 
industrial environments. The general approach is to predict one sensor value 
based on others.

> Multilayer perceptron regression
> 
>
> Key: SPARK-10409
> URL: https://issues.apache.org/jira/browse/SPARK-10409
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Implement regression based on multilayer perceptron (MLP). It should support 
> different kinds of outputs: binary, real in [0;1) and real in [-inf; +inf]. 
> The implementation might take advantage of autoencoder. Time-series 
> forecasting for financial data might be one of the use cases, see 
> http://dl.acm.org/citation.cfm?id=561452. So there is the need for more 
> specific requirements from this (or other) area.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24458) Invalid PythonUDF check_1(), requires attributes from more than one child

2018-06-04 Thread Abdeali Kothari (JIRA)
Abdeali Kothari created SPARK-24458:
---

 Summary: Invalid PythonUDF check_1(), requires attributes from 
more than one child
 Key: SPARK-24458
 URL: https://issues.apache.org/jira/browse/SPARK-24458
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.0
 Environment: Spark 2.3.0 (local mode)

Mac OSX
Reporter: Abdeali Kothari


I was trying out a very large query execution plan I have and I got the error:

 
{code:java}
py4j.protocol.Py4JJavaError: An error occurred while calling o359.simpleString.
: java.lang.RuntimeException: Invalid PythonUDF check_1(), requires attributes 
from more than one child.
 at scala.sys.package$.error(package.scala:27)
 at 
org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:182)
 at 
org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:181)
 at scala.collection.immutable.Stream.foreach(Stream.scala:594)
 at 
org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:181)
 at 
org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:118)
 at 
org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
 at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
 at 
org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:114)
 at 
org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:94)
 at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:87)
 at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:87)
 at 
scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
 at scala.collection.immutable.List.foldLeft(List.scala:84)
 at 
org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:87)
 at 
org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
 at 
org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77)
 at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:187)
 at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:187)
 at 
org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:100)
 at 
org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:187)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
 at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
 at py4j.Gateway.invoke(Gateway.java:282)
 at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
 at py4j.commands.CallCommand.execute(CallCommand.java:79)
 at py4j.GatewayConnection.run(GatewayConnection.java:214)
 at java.lang.Thread.run(Thread.java:748){code}
I get a dataframe (df) after a lot of P

[jira] [Created] (SPARK-24457) Performance improvement while converting stringToTimestamp in DateTimeUtils

2018-06-04 Thread Sharad Sonker (JIRA)
Sharad Sonker created SPARK-24457:
-

 Summary: Performance improvement while converting 
stringToTimestamp in DateTimeUtils
 Key: SPARK-24457
 URL: https://issues.apache.org/jira/browse/SPARK-24457
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Sharad Sonker


stringToTimestamp in DateTimeUtils creates Calendar instance for each input row 
even if the input timezone is same. This can be improved by caching the 
calendar instance for each input timezone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org