[jira] [Commented] (SPARK-20052) Some InputDStream needs closing processing after processing all batches when graceful shutdown

2017-03-21 Thread Sasaki Toru (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935789#comment-15935789
 ] 

Sasaki Toru commented on SPARK-20052:
-

My explain is not good, sorry.

This ticket is related to SPARK-20050.
In JobGenerate#stop, it will wait for finishing all batches after 
InputDStream#stop called when graceful shutdown is enable,
but Kafka 0.10 DirectStream should commit offset after processing all batches.

So I thought more process(I explained this "closing process") is needed after 
processing all batches.


> Some InputDStream needs closing processing after processing all batches when 
> graceful shutdown
> --
>
> Key: SPARK-20052
> URL: https://issues.apache.org/jira/browse/SPARK-20052
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Sasaki Toru
>
> Some class extend InputDStream needs closing processing after processing all 
> batches when graceful shutdown enabled.
> (e.g. When using Kafka as data source, need to commit processed offsets to 
> Kafka Broker)
> InputDStream has method 'stop' to stop receiving data, but this method will 
> be called before processing last batches generated for graceful shutdown.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19925) SparkR spark.getSparkFiles fails on executor

2017-03-21 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-19925:

Fix Version/s: 2.1.1

> SparkR spark.getSparkFiles fails on executor
> 
>
> Key: SPARK-19925
> URL: https://issues.apache.org/jira/browse/SPARK-19925
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Critical
> Fix For: 2.1.1, 2.2.0
>
> Attachments: error-log
>
>
> SparkR function {{spark.getSparkFiles}} fails when it was called on 
> executors. For examples, the following R code will fail. (See error logs in 
> attachment.) 
> {code}
> spark.addFile("./README.md")
> seq <- seq(from = 1, to = 10, length.out = 5)
> train <- function(seq) {
> path <- spark.getSparkFiles("README.md")
> print(path)
> }
> spark.lapply(seq, train)
> {code}
> However, we can run successfully with Scala API:
> {code}
> import org.apache.spark.SparkFiles
> sc.addFile("./README.md”)
> sc.parallelize(Seq(0)).map{ _ => SparkFiles.get("README.md")}.first()
> {code}
> and also successfully with Python API:
> {code}
> from pyspark import SparkFiles
> sc.addFile("./README.md")
> sc.parallelize(range(1)).map(lambda x: SparkFiles.get("README.md")).first()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19925) SparkR spark.getSparkFiles fails on executor

2017-03-21 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-19925.
-
Resolution: Fixed

> SparkR spark.getSparkFiles fails on executor
> 
>
> Key: SPARK-19925
> URL: https://issues.apache.org/jira/browse/SPARK-19925
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Critical
> Fix For: 2.1.1, 2.2.0
>
> Attachments: error-log
>
>
> SparkR function {{spark.getSparkFiles}} fails when it was called on 
> executors. For examples, the following R code will fail. (See error logs in 
> attachment.) 
> {code}
> spark.addFile("./README.md")
> seq <- seq(from = 1, to = 10, length.out = 5)
> train <- function(seq) {
> path <- spark.getSparkFiles("README.md")
> print(path)
> }
> spark.lapply(seq, train)
> {code}
> However, we can run successfully with Scala API:
> {code}
> import org.apache.spark.SparkFiles
> sc.addFile("./README.md”)
> sc.parallelize(Seq(0)).map{ _ => SparkFiles.get("README.md")}.first()
> {code}
> and also successfully with Python API:
> {code}
> from pyspark import SparkFiles
> sc.addFile("./README.md")
> sc.parallelize(range(1)).map(lambda x: SparkFiles.get("README.md")).first()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19925) SparkR spark.getSparkFiles fails on executor

2017-03-21 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-19925:

Target Version/s: 2.2.0
   Fix Version/s: 2.2.0

> SparkR spark.getSparkFiles fails on executor
> 
>
> Key: SPARK-19925
> URL: https://issues.apache.org/jira/browse/SPARK-19925
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Critical
> Fix For: 2.2.0
>
> Attachments: error-log
>
>
> SparkR function {{spark.getSparkFiles}} fails when it was called on 
> executors. For examples, the following R code will fail. (See error logs in 
> attachment.) 
> {code}
> spark.addFile("./README.md")
> seq <- seq(from = 1, to = 10, length.out = 5)
> train <- function(seq) {
> path <- spark.getSparkFiles("README.md")
> print(path)
> }
> spark.lapply(seq, train)
> {code}
> However, we can run successfully with Scala API:
> {code}
> import org.apache.spark.SparkFiles
> sc.addFile("./README.md”)
> sc.parallelize(Seq(0)).map{ _ => SparkFiles.get("README.md")}.first()
> {code}
> and also successfully with Python API:
> {code}
> from pyspark import SparkFiles
> sc.addFile("./README.md")
> sc.parallelize(range(1)).map(lambda x: SparkFiles.get("README.md")).first()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19925) SparkR spark.getSparkFiles fails on executor

2017-03-21 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-19925:
---

Assignee: Yanbo Liang

> SparkR spark.getSparkFiles fails on executor
> 
>
> Key: SPARK-19925
> URL: https://issues.apache.org/jira/browse/SPARK-19925
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Critical
> Fix For: 2.2.0
>
> Attachments: error-log
>
>
> SparkR function {{spark.getSparkFiles}} fails when it was called on 
> executors. For examples, the following R code will fail. (See error logs in 
> attachment.) 
> {code}
> spark.addFile("./README.md")
> seq <- seq(from = 1, to = 10, length.out = 5)
> train <- function(seq) {
> path <- spark.getSparkFiles("README.md")
> print(path)
> }
> spark.lapply(seq, train)
> {code}
> However, we can run successfully with Scala API:
> {code}
> import org.apache.spark.SparkFiles
> sc.addFile("./README.md”)
> sc.parallelize(Seq(0)).map{ _ => SparkFiles.get("README.md")}.first()
> {code}
> and also successfully with Python API:
> {code}
> from pyspark import SparkFiles
> sc.addFile("./README.md")
> sc.parallelize(range(1)).map(lambda x: SparkFiles.get("README.md")).first()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20052) Some InputDStream needs closing processing after processing all batches when graceful shutdown

2017-03-21 Thread Sasaki Toru (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sasaki Toru updated SPARK-20052:

Summary: Some InputDStream needs closing processing after processing all 
batches when graceful shutdown  (was: Some InputDStream needs closing 
processing after all batches processed when graceful shutdown)

> Some InputDStream needs closing processing after processing all batches when 
> graceful shutdown
> --
>
> Key: SPARK-20052
> URL: https://issues.apache.org/jira/browse/SPARK-20052
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Sasaki Toru
>
> Some class extend InputDStream needs closing processing after processing all 
> batches when graceful shutdown enabled.
> (e.g. When using Kafka as data source, need to commit processed offsets to 
> Kafka Broker)
> InputDStream has method 'stop' to stop receiving data, but this method will 
> be called before processing last batches generated for graceful shutdown.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20052) Some InputDStream needs closing processing after all batches processed when graceful shutdown

2017-03-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935768#comment-15935768
 ] 

Sean Owen commented on SPARK-20052:
---

What do you have in mind? I don't think stopping the stream makes all batches 
finish immediately.

> Some InputDStream needs closing processing after all batches processed when 
> graceful shutdown
> -
>
> Key: SPARK-20052
> URL: https://issues.apache.org/jira/browse/SPARK-20052
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Sasaki Toru
>
> Some class extend InputDStream needs closing processing after processing all 
> batches when graceful shutdown enabled.
> (e.g. When using Kafka as data source, need to commit processed offsets to 
> Kafka Broker)
> InputDStream has method 'stop' to stop receiving data, but this method will 
> be called before processing last batches generated for graceful shutdown.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20030) Add Event Time based Timeout

2017-03-21 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-20030.
---
Resolution: Fixed

Issue resolved by pull request 17361
[https://github.com/apache/spark/pull/17361]

> Add Event Time based Timeout
> 
>
> Key: SPARK-20030
> URL: https://issues.apache.org/jira/browse/SPARK-20030
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13947) The error message from using an invalid table reference is not clear

2017-03-21 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-13947:

Priority: Minor  (was: Major)

> The error message from using an invalid table reference is not clear
> 
>
> Key: SPARK-13947
> URL: https://issues.apache.org/jira/browse/SPARK-13947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wes McKinney
>Priority: Minor
>
> {code}
> import numpy as np
> import pandas as pd
> df = pd.DataFrame({'foo': np.random.randn(1000),
>'bar': np.random.randn(1000)})
> df2 = pd.DataFrame({'foo': np.random.randn(1000),
> 'bar': np.random.randn(1000)})
> sdf = sqlContext.createDataFrame(df)
> sdf2 = sqlContext.createDataFrame(df2)
> sdf[sdf2.foo > 0]
> {code}
> Produces this error message:
> {code}
> AnalysisException: u'resolved attribute(s) foo#91 missing from bar#87,foo#88 
> in operator !Filter (foo#91 > cast(0 as double));'
> {code}
> It may be possible to make it more clear what the user did wrong. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13947) PySpark DataFrames: The error message from using an invalid table reference is not clear

2017-03-21 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-13947:

Component/s: (was: PySpark)
 SQL

> PySpark DataFrames: The error message from using an invalid table reference 
> is not clear
> 
>
> Key: SPARK-13947
> URL: https://issues.apache.org/jira/browse/SPARK-13947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wes McKinney
>
> {code}
> import numpy as np
> import pandas as pd
> df = pd.DataFrame({'foo': np.random.randn(1000),
>'bar': np.random.randn(1000)})
> df2 = pd.DataFrame({'foo': np.random.randn(1000),
> 'bar': np.random.randn(1000)})
> sdf = sqlContext.createDataFrame(df)
> sdf2 = sqlContext.createDataFrame(df2)
> sdf[sdf2.foo > 0]
> {code}
> Produces this error message:
> {code}
> AnalysisException: u'resolved attribute(s) foo#91 missing from bar#87,foo#88 
> in operator !Filter (foo#91 > cast(0 as double));'
> {code}
> It may be possible to make it more clear what the user did wrong. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13947) The error message from using an invalid table reference is not clear

2017-03-21 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-13947:

Summary: The error message from using an invalid table reference is not 
clear  (was: PySpark DataFrames: The error message from using an invalid table 
reference is not clear)

> The error message from using an invalid table reference is not clear
> 
>
> Key: SPARK-13947
> URL: https://issues.apache.org/jira/browse/SPARK-13947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wes McKinney
>
> {code}
> import numpy as np
> import pandas as pd
> df = pd.DataFrame({'foo': np.random.randn(1000),
>'bar': np.random.randn(1000)})
> df2 = pd.DataFrame({'foo': np.random.randn(1000),
> 'bar': np.random.randn(1000)})
> sdf = sqlContext.createDataFrame(df)
> sdf2 = sqlContext.createDataFrame(df2)
> sdf[sdf2.foo > 0]
> {code}
> Produces this error message:
> {code}
> AnalysisException: u'resolved attribute(s) foo#91 missing from bar#87,foo#88 
> in operator !Filter (foo#91 > cast(0 as double));'
> {code}
> It may be possible to make it more clear what the user did wrong. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20035) Spark 2.0.2 writes empty file if no record is in the dataset

2017-03-21 Thread Ryan Magnusson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935693#comment-15935693
 ] 

Ryan Magnusson commented on SPARK-20035:


I'd like to start looking into this if no one else is already.

> Spark 2.0.2 writes empty file if no record is in the dataset
> 
>
> Key: SPARK-20035
> URL: https://issues.apache.org/jira/browse/SPARK-20035
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.2
> Environment: Spark 2.0.2
> Linux/Windows
>Reporter: Andrew
>
> When there is no record in a dataset, the call to write with the spark-csv 
> creates empty file (i.e. with no title line)
> ```
> dataset.write().format("com.databricks.spark.csv").option("header", 
> "true").save("... file name here ...");
> or 
> dataset.write().option("header", "true").csv("... file name here ...");
> ```
> The same file then cannot be read by using the same format (i.e. spark-csv) 
> since it is empty as below. The same call works if the dataset has at least 
> one record.
> ```
> sqlCtx.read().format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").load("... file name here ...");
> or 
> sparkSession.read().option("header", "true").option("inferSchema", 
> "true").csv("... file name here ...");
> ```
> This is not right, you should always be able to read the file that you wrote 
> to.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3165) DecisionTree does not use sparsity in data

2017-03-21 Thread Facai Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935646#comment-15935646
 ] 

Facai Yan edited comment on SPARK-3165 at 3/22/17 1:57 AM:
---

Do you mean that:
TreePoint.binnedFeatures is Array[int], which doesn't use sparsity in data?

So those modifications is need:
1. modify TreePoint.binnedFeatures to Vector.
2. modify LearningNode.predictImpl method if need.
3. modify the methods about Bin-wise computation, such as binSeqOp, to 
accelerate computation.

Please correct me if misunderstand.

I'd like to work on this if no one else has started it.


was (Author: facai):
Do you mean that:
TreePoint.binnedFeatures is Array[int], which doesn't sparsity in data?

So those modifications is need:
1. modify TreePoint.binnedFeatures to Vector.
2. modify LearningNode.predictImpl method if need.
3. modify the methods about Bin-wise computation, such as binSeqOp, to 
accelerate computation.

Please correct me if misunderstand.

I'd like to work on it.

> DecisionTree does not use sparsity in data
> --
>
> Key: SPARK-3165
> URL: https://issues.apache.org/jira/browse/SPARK-3165
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Improvement: computation
> DecisionTree should take advantage of sparse feature vectors.  Aggregation 
> over training data could handle the empty/zero-valued data elements more 
> efficiently.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3165) DecisionTree does not use sparsity in data

2017-03-21 Thread Facai Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935646#comment-15935646
 ] 

Facai Yan commented on SPARK-3165:
--

Do you mean that:
TreePoint.binnedFeatures is Array[int], which doesn't sparsity in data?

So those modifications is need:
1. modify TreePoint.binnedFeatures to Vector.
2. modify LearningNode.predictImpl method if need.
3. modify the methods about Bin-wise computation, such as binSeqOp, to 
accelerate computation.

Please correct me if misunderstand.

I'd like to work on it.

> DecisionTree does not use sparsity in data
> --
>
> Key: SPARK-3165
> URL: https://issues.apache.org/jira/browse/SPARK-3165
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Improvement: computation
> DecisionTree should take advantage of sparse feature vectors.  Aggregation 
> over training data could handle the empty/zero-valued data elements more 
> efficiently.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20051) Fix StreamSuite.recover from v2.1 checkpoint failing with IOException

2017-03-21 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-20051.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17382
[https://github.com/apache/spark/pull/17382]

> Fix StreamSuite.recover from v2.1 checkpoint failing with IOException
> -
>
> Key: SPARK-20051
> URL: https://issues.apache.org/jira/browse/SPARK-20051
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kunal Khamar
> Fix For: 2.2.0
>
>
> There is a race condition between calling stop on a streaming query and 
> deleting directories in withTempDir that causes test to fail, fixing to do 
> lazy deletion using delete on shutdown JVM hook.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20009) Use user-friendly DDL formats for defining a schema in user-facing APIs

2017-03-21 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935629#comment-15935629
 ] 

Xiao Li commented on SPARK-20009:
-

[~marmbrus] Does it sound OK to you?

> Use user-friendly DDL formats for defining a schema  in user-facing APIs
> 
>
> Key: SPARK-20009
> URL: https://issues.apache.org/jira/browse/SPARK-20009
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>
> In https://issues.apache.org/jira/browse/SPARK-19830, we add a new API in the 
> DDL parser to convert a DDL string into a schema. Then, we can use DDL 
> formats in existing some APIs, e.g., functions.from_json 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3062.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20052) Some InputDStream needs closing processing after all batches processed when graceful shutdown

2017-03-21 Thread Sasaki Toru (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sasaki Toru updated SPARK-20052:

Description: 
Some class extend InputDStream needs closing processing after processing all 
batches when graceful shutdown enabled.
(e.g. When using Kafka as data source, need to commit processed offsets to 
Kafka Broker)

InputDStream has method 'stop' to stop receiving data, but this method will be 
called before processing last batches generated for graceful shutdown.


  was:
Some class extend InputDStream needs closing processing after all batches 
processed when graceful shutdown enabled.
(e.g. When using Kafka as data source, need to commit processed offsets to 
Kafka Broker)

InputDStream has method 'stop' to stop receiving data, but this method will be 
called before processing last batches generated for graceful shutdown.



> Some InputDStream needs closing processing after all batches processed when 
> graceful shutdown
> -
>
> Key: SPARK-20052
> URL: https://issues.apache.org/jira/browse/SPARK-20052
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Sasaki Toru
>
> Some class extend InputDStream needs closing processing after processing all 
> batches when graceful shutdown enabled.
> (e.g. When using Kafka as data source, need to commit processed offsets to 
> Kafka Broker)
> InputDStream has method 'stop' to stop receiving data, but this method will 
> be called before processing last batches generated for graceful shutdown.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20008) hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 1

2017-03-21 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935620#comment-15935620
 ] 

Hyukjin Kwon commented on SPARK-20008:
--

Thank you for your kind explanation. I think you are more insightful in this 
issue than me. Could you fix this?

> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 
> 1
> ---
>
> Key: SPARK-20008
> URL: https://issues.apache.org/jira/browse/SPARK-20008
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.2.0
>Reporter: Ravindra Bajpai
>
> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() yields 
> 1 against expected 0.
> This was not the case with spark 1.5.2. This is an api change from usage 
> point of view and hence I consider this as a bug. May be a boundary case, not 
> sure.
> Work around - For now I check the counts != 0 before this operation. Not good 
> for performance. Hence creating a jira to track it.
> As Young Zhang explained in reply to my mail - 
> Starting from Spark 2, these kind of operation are implemented in left anti 
> join, instead of using RDD operation directly.
> Same issue also on sqlContext.
> scala> spark.version
> res25: String = 2.0.2
> spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true)
> == Physical Plan ==
> *HashAggregate(keys=[], functions=[], output=[])
> +- Exchange SinglePartition
>+- *HashAggregate(keys=[], functions=[], output=[])
>   +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false
>  :- Scan ExistingRDD[]
>  +- BroadcastExchange IdentityBroadcastMode
> +- Scan ExistingRDD[]
> This arguably means a bug. But my guess is liking the logic of comparing NULL 
> = NULL, should it return true or false, causing this kind of confusion. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20054) [Mesos] Detectability for resource starvation

2017-03-21 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935605#comment-15935605
 ] 

Michael Gummelt commented on SPARK-20054:
-

Sounds like this could be solved just by having some better logging?  Something 
that indicates the driver is waiting for more registered executors?

> [Mesos] Detectability for resource starvation
> -
>
> Key: SPARK-20054
> URL: https://issues.apache.org/jira/browse/SPARK-20054
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Scheduler
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Kamal Gurala
>Priority: Minor
>
> We currently use Mesos 1.1.0 for our Spark cluster in coarse-grained mode. We 
> had a production issue recently wherein we had our spark frameworks accept 
> resources from the Mesos master, so executors were started and spark driver 
> was aware of them, but the driver didn’t plan any task and nothing was 
> happening for a long time because it didn't meet a minimum registered 
> resources threshold. and the cluster is usually under-provisioned in order 
> because not all the jobs need to run at the same time. These held resources 
> were never offered back to the master for re-allocation leading to the entire 
> cluster to a halt until we had to manually intervene. 
> Using DRF for mesos and FIFO for Spark and the cluster is usually 
> under-provisioned. At any point of time there could be 10-15 spark frameworks 
> running on Mesos on the under-provisioned cluster 
> The ask is to have a way to better recoverability or detectability for a 
> scenario where the individual Spark frameworks hold onto resources but never 
> launch any tasks or have these frameworks release these resources after a 
> fixed amount of time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20008) hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 1

2017-03-21 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935603#comment-15935603
 ] 

Xiao Li commented on SPARK-20008:
-

In the traditional RDBMS, we do not allow users to create a table with zero 
column. Thus, the existing solution did not cover it. Do you want to fix it? 
[~hyukjin.kwon] Or you want me to fix it?

> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 
> 1
> ---
>
> Key: SPARK-20008
> URL: https://issues.apache.org/jira/browse/SPARK-20008
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.2.0
>Reporter: Ravindra Bajpai
>
> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() yields 
> 1 against expected 0.
> This was not the case with spark 1.5.2. This is an api change from usage 
> point of view and hence I consider this as a bug. May be a boundary case, not 
> sure.
> Work around - For now I check the counts != 0 before this operation. Not good 
> for performance. Hence creating a jira to track it.
> As Young Zhang explained in reply to my mail - 
> Starting from Spark 2, these kind of operation are implemented in left anti 
> join, instead of using RDD operation directly.
> Same issue also on sqlContext.
> scala> spark.version
> res25: String = 2.0.2
> spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true)
> == Physical Plan ==
> *HashAggregate(keys=[], functions=[], output=[])
> +- Exchange SinglePartition
>+- *HashAggregate(keys=[], functions=[], output=[])
>   +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false
>  :- Scan ExistingRDD[]
>  +- BroadcastExchange IdentityBroadcastMode
> +- Scan ExistingRDD[]
> This arguably means a bug. But my guess is liking the logic of comparing NULL 
> = NULL, should it return true or false, causing this kind of confusion. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20009) Use user-friendly DDL formats for defining a schema in user-facing APIs

2017-03-21 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935599#comment-15935599
 ] 

Takeshi Yamamuro edited comment on SPARK-20009 at 3/22/17 1:02 AM:
---

I meant we support both a json-format and a new DDL format in existing APIs, as 
you said.
This is like: 
https://github.com/apache/spark/compare/master...maropu:UserDDLForSchema#diff-df78a74ef92d9b8fb4ac142ff9a62464R111


was (Author: maropu):
I meant we support both a json-format and a new DDL format in existing APIs, as 
you said.

> Use user-friendly DDL formats for defining a schema  in user-facing APIs
> 
>
> Key: SPARK-20009
> URL: https://issues.apache.org/jira/browse/SPARK-20009
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>
> In https://issues.apache.org/jira/browse/SPARK-19830, we add a new API in the 
> DDL parser to convert a DDL string into a schema. Then, we can use DDL 
> formats in existing some APIs, e.g., functions.from_json 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3062.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20009) Use user-friendly DDL formats for defining a schema in user-facing APIs

2017-03-21 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935599#comment-15935599
 ] 

Takeshi Yamamuro commented on SPARK-20009:
--

I meant we support both a json-format and a new DDL format in existing APIs, as 
you said.

> Use user-friendly DDL formats for defining a schema  in user-facing APIs
> 
>
> Key: SPARK-20009
> URL: https://issues.apache.org/jira/browse/SPARK-20009
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>
> In https://issues.apache.org/jira/browse/SPARK-19830, we add a new API in the 
> DDL parser to convert a DDL string into a schema. Then, we can use DDL 
> formats in existing some APIs, e.g., functions.from_json 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3062.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



subscribe to spark issues

2017-03-21 Thread Yash Sharma
subscribe to spark issues


[jira] [Resolved] (SPARK-19919) Defer input path validation into DataSource in CSV datasource

2017-03-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19919.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17256
[https://github.com/apache/spark/pull/17256]

> Defer input path validation into DataSource in CSV datasource
> -
>
> Key: SPARK-19919
> URL: https://issues.apache.org/jira/browse/SPARK-19919
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
> Fix For: 2.2.0
>
>
> Currently, if other datasources fail to infer the schema, it returns {{None}} 
> and then this is being validated in {{DataSource}} as below:
> {code}
> scala> spark.read.json("emptydir")
> org.apache.spark.sql.AnalysisException: Unable to infer schema for JSON. It 
> must be specified manually.;
> {code}
> {code}
> scala> spark.read.orc("emptydir")
> org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It 
> must be specified manually.;
> {code}
> {code}
> scala> spark.read.parquet("emptydir")
> org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. 
> It must be specified manually.;
> {code}
> However, CSV it checks it within the datasource implementation and throws 
> another exception message as below:
> {code}
> scala> spark.read.csv("emptydir")
> java.lang.IllegalArgumentException: requirement failed: Cannot infer schema 
> from an empty set of files
> {code}
> We could remove this duplicated check and validate this in one place in the 
> same way with the same message.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19919) Defer input path validation into DataSource in CSV datasource

2017-03-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19919:
---

Assignee: Hyukjin Kwon

> Defer input path validation into DataSource in CSV datasource
> -
>
> Key: SPARK-19919
> URL: https://issues.apache.org/jira/browse/SPARK-19919
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
> Fix For: 2.2.0
>
>
> Currently, if other datasources fail to infer the schema, it returns {{None}} 
> and then this is being validated in {{DataSource}} as below:
> {code}
> scala> spark.read.json("emptydir")
> org.apache.spark.sql.AnalysisException: Unable to infer schema for JSON. It 
> must be specified manually.;
> {code}
> {code}
> scala> spark.read.orc("emptydir")
> org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It 
> must be specified manually.;
> {code}
> {code}
> scala> spark.read.parquet("emptydir")
> org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. 
> It must be specified manually.;
> {code}
> However, CSV it checks it within the datasource implementation and throws 
> another exception message as below:
> {code}
> scala> spark.read.csv("emptydir")
> java.lang.IllegalArgumentException: requirement failed: Cannot infer schema 
> from an empty set of files
> {code}
> We could remove this duplicated check and validate this in one place in the 
> same way with the same message.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19980) Basic Dataset transformation on POJOs does not preserves nulls.

2017-03-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-19980:

Fix Version/s: 2.1.1

> Basic Dataset transformation on POJOs does not preserves nulls.
> ---
>
> Key: SPARK-19980
> URL: https://issues.apache.org/jira/browse/SPARK-19980
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Michel Lemay
>Assignee: Takeshi Yamamuro
> Fix For: 2.1.1, 2.2.0
>
>
> Applying an identity map transformation on a statically typed Dataset with a 
> POJO produces an unexpected result.
> Given POJOs:
> {code}
> public class Stuff implements Serializable {
> private String name;
> public void setName(String name) { this.name = name; }
> public String getName() { return name; }
> }
> public class Outer implements Serializable {
> private String name;
> private Stuff stuff;
> public void setName(String name) { this.name = name; }
> public String getName() { return name; }
> public void setStuff(Stuff stuff) { this.stuff = stuff; }
> public Stuff getStuff() { return stuff; }
> }
> {code}
> Produces the result:
> {code}
> scala> val encoder = Encoders.bean(classOf[Outer])
> encoder: org.apache.spark.sql.Encoder[pojos.Outer] = class[name[0]: string, 
> stuff[0]: struct]
> scala> val schema = encoder.schema
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(name,StringType,true), 
> StructField(stuff,StructType(StructField(name,StringType,true)),true))
> scala> schema.printTreeString
> root
>  |-- name: string (nullable = true)
>  |-- stuff: struct (nullable = true)
>  ||-- name: string (nullable = true)
> scala> val df = 
> spark.read.schema(schema).json("stuff.json").as[Outer](encoder)
> df: org.apache.spark.sql.Dataset[pojos.Outer] = [name: string, stuff: 
> struct]
> scala> df.show()
> ++-+
> |name|stuff|
> ++-+
> |  v1| null|
> ++-+
> scala> df.map(x => x)(encoder).show()
> ++--+
> |name| stuff|
> ++--+
> |  v1|[null]|
> ++--+
> {code}
> After identity transformation, `stuff` becomes an object with null values 
> inside it instead of staying null itself.
> Doing the same with case classes preserves the nulls:
> {code}
> scala> case class ScalaStuff(name: String)
> defined class ScalaStuff
> scala> case class ScalaOuter(name: String, stuff: ScalaStuff)
> defined class ScalaOuter
> scala> val encoder2 = Encoders.product[ScalaOuter]
> encoder2: org.apache.spark.sql.Encoder[ScalaOuter] = class[name[0]: string, 
> stuff[0]: struct]
> scala> val schema2 = encoder2.schema
> schema2: org.apache.spark.sql.types.StructType = 
> StructType(StructField(name,StringType,true), 
> StructField(stuff,StructType(StructField(name,StringType,true)),true))
> scala> schema2.printTreeString
> root
>  |-- name: string (nullable = true)
>  |-- stuff: struct (nullable = true)
>  ||-- name: string (nullable = true)
> scala>
> scala> val df2 = spark.read.schema(schema2).json("stuff.json").as[ScalaOuter]
> df2: org.apache.spark.sql.Dataset[ScalaOuter] = [name: string, stuff: 
> struct]
> scala> df2.show()
> ++-+
> |name|stuff|
> ++-+
> |  v1| null|
> ++-+
> scala> df2.map(x => x).show()
> ++-+
> |name|stuff|
> ++-+
> |  v1| null|
> ++-+
> {code}
> stuff.json:
> {code}
> {"name":"v1", "stuff":null }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20054) [Mesos] Detectability for resource starvation

2017-03-21 Thread Kamal Gurala (JIRA)
Kamal Gurala created SPARK-20054:


 Summary: [Mesos] Detectability for resource starvation
 Key: SPARK-20054
 URL: https://issues.apache.org/jira/browse/SPARK-20054
 Project: Spark
  Issue Type: Improvement
  Components: Mesos, Scheduler
Affects Versions: 2.1.0, 2.0.2, 2.0.1, 2.0.0
Reporter: Kamal Gurala
Priority: Minor


We currently use Mesos 1.1.0 for our Spark cluster in coarse-grained mode. We 
had a production issue recently wherein we had our spark frameworks accept 
resources from the Mesos master, so executors were started and spark driver was 
aware of them, but the driver didn’t plan any task and nothing was happening 
for a long time because it didn't meet a minimum registered resources 
threshold. and the cluster is usually under-provisioned in order because not 
all the jobs need to run at the same time. These held resources were never 
offered back to the master for re-allocation leading to the entire cluster to a 
halt until we had to manually intervene. 

Using DRF for mesos and FIFO for Spark and the cluster is usually 
under-provisioned. At any point of time there could be 10-15 spark frameworks 
running on Mesos on the under-provisioned cluster 

The ask is to have a way to better recoverability or detectability for a 
scenario where the individual Spark frameworks hold onto resources but never 
launch any tasks or have these frameworks release these resources after a fixed 
amount of time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20053) Can't select col when the dot (.) in col name

2017-03-21 Thread Xuxiang Mao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935568#comment-15935568
 ] 

Xuxiang Mao commented on SPARK-20053:
-

This is how my code looks like: 

   String cmdOutputFile = "/Downloads/output.csv";

SparkSession spark = SparkSession
.builder().master("local[*]")
.appName("PostProcessingBeta")
.getOrCreate();

Dataset df = spark.read().option("maxCharsPerColumn", 
"4096").option("inferSchema", true).option("header", true).option("comment", 
"#").csv(cmdOutputFile);
df.select("sd_1_2").show();  // this can successfully return the 
result. no "." in the column name.

df.select("r_2_shape_1.8").show();  // this will throw the exception

> Can't select col when the dot (.) in col name
> -
>
> Key: SPARK-20053
> URL: https://issues.apache.org/jira/browse/SPARK-20053
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.0
> Environment: mac OX
>Reporter: Xuxiang Mao
>
> I use java API read a csv file as Dataframe and try to do 
> Dataframe.select("column name").show(). This operation can successfully done 
> when the column name contains no ".", but it will fail when the column name 
> has ".". ERROR:
> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve '`r_2_shape_1.8`' given input columns: [z_2.1.1, z_2.1.11,   
> r_1.34.2, r_1.14.2, r_2_shape_1.8, z_1.2.39];;
> 'Project ['r_2_shape_1.8]
> +- TypedFilter 
> com.amazon.recommerce.pricing.forecasting.postProcessing.utils.RawFileUtils$1@a03529c,
>  interface org.apache.spark.sql.Row, [StructField(lp__,IntegerType,true), 
> StructField(b.1,DoubleType,true), 
> StructField(temp_Intercept,DoubleType,true), 
> StructField(b_shape.1,DoubleType,true), StructField(sd_1.1,DoubleType,true), 
> StructField(sd_1_2,DoubleType,true), StructField(z_1.1.1,DoubleType,true), 
> StructField(z_1.2.1,DoubleType,true), StructField(z_1.1.2,DoubleType,true), 
> StructField(z_1.2.2,DoubleType,true), StructField(z_1.1.3,DoubleType,true), 
> StructField(z_1.2.3,DoubleType,true), StructField(z_1.1.4,DoubleType,true), 
> StructField(z_1.2.4,DoubleType,true), StructField(z_1.1.5,DoubleType,true), 
> StructField(z_1.2.5,DoubleType,true), StructField(z_1.1.6,DoubleType,true), 
> StructField(z_1.2.6,DoubleType,true), StructField(z_1.1.7,DoubleType,true), 
> StructField(z_1.2.7,DoubleType,true), StructField(z_1.1.8,DoubleType,true), 
> StructField(z_1.2.8,DoubleType,true), StructField(z_1.1.9,DoubleType,true), 
> StructField(z_1.2.9,DoubleType,true), ... 294 more fields], 
> createexternalrow(lp__#0, b.1#1, temp_Intercept#2, b_shape.1#3, sd_1.1#4, 
> sd_1_2#5, z_1.1.1#6, z_1.2.1#7, z_1.1.2#8, z_1.2.2#9, z_1.1.3#10, z_1.2.3#11, 
> z_1.1.4#12, z_1.2.4#13, z_1.1.5#14, z_1.2.5#15, z_1.1.6#16, z_1.2.6#17, 
> z_1.1.7#18, z_1.2.7#19, z_1.1.8#20, z_1.2.8#21, z_1.1.9#22, z_1.2.9#23, ... 
> 612 more fields)
>+- 
> Relation[lp__#0,b.1#1,temp_Intercept#2,b_shape.1#3,sd_1.1#4,sd_1_2#5,z_1.1.1#6,z_1.2.1#7,z_1.1.2#8,z_1.2.2#9,z_1.1.3#10,z_1.2.3#11,z_1.1.4#12,z_1.2.4#13,z_1.1.5#14,z_1.2.5#15,z_1.1.6#16,z_1.2.6#17,z_1.1.7#18,z_1.2.7#19,z_1.1.8#20,z_1.2.8#21,z_1.1.9#22,z_1.2.9#23,...
>  294 more fields] csv
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:282)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:292)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:296)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at 

[jira] [Created] (SPARK-20053) Can't select col when the dot (.) in col name

2017-03-21 Thread Xuxiang Mao (JIRA)
Xuxiang Mao created SPARK-20053:
---

 Summary: Can't select col when the dot (.) in col name
 Key: SPARK-20053
 URL: https://issues.apache.org/jira/browse/SPARK-20053
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 2.1.0
 Environment: mac OX
Reporter: Xuxiang Mao


I use java API read a csv file as Dataframe and try to do 
Dataframe.select("column name").show(). This operation can successfully done 
when the column name contains no ".", but it will fail when the column name has 
".". ERROR:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
resolve '`r_2_shape_1.8`' given input columns: [z_2.1.1, z_2.1.11,   
r_1.34.2, r_1.14.2, r_2_shape_1.8, z_1.2.39];;
'Project ['r_2_shape_1.8]
+- TypedFilter 
com.amazon.recommerce.pricing.forecasting.postProcessing.utils.RawFileUtils$1@a03529c,
 interface org.apache.spark.sql.Row, [StructField(lp__,IntegerType,true), 
StructField(b.1,DoubleType,true), StructField(temp_Intercept,DoubleType,true), 
StructField(b_shape.1,DoubleType,true), StructField(sd_1.1,DoubleType,true), 
StructField(sd_1_2,DoubleType,true), StructField(z_1.1.1,DoubleType,true), 
StructField(z_1.2.1,DoubleType,true), StructField(z_1.1.2,DoubleType,true), 
StructField(z_1.2.2,DoubleType,true), StructField(z_1.1.3,DoubleType,true), 
StructField(z_1.2.3,DoubleType,true), StructField(z_1.1.4,DoubleType,true), 
StructField(z_1.2.4,DoubleType,true), StructField(z_1.1.5,DoubleType,true), 
StructField(z_1.2.5,DoubleType,true), StructField(z_1.1.6,DoubleType,true), 
StructField(z_1.2.6,DoubleType,true), StructField(z_1.1.7,DoubleType,true), 
StructField(z_1.2.7,DoubleType,true), StructField(z_1.1.8,DoubleType,true), 
StructField(z_1.2.8,DoubleType,true), StructField(z_1.1.9,DoubleType,true), 
StructField(z_1.2.9,DoubleType,true), ... 294 more fields], 
createexternalrow(lp__#0, b.1#1, temp_Intercept#2, b_shape.1#3, sd_1.1#4, 
sd_1_2#5, z_1.1.1#6, z_1.2.1#7, z_1.1.2#8, z_1.2.2#9, z_1.1.3#10, z_1.2.3#11, 
z_1.1.4#12, z_1.2.4#13, z_1.1.5#14, z_1.2.5#15, z_1.1.6#16, z_1.2.6#17, 
z_1.1.7#18, z_1.2.7#19, z_1.1.8#20, z_1.2.8#21, z_1.1.9#22, z_1.2.9#23, ... 612 
more fields)
   +- 
Relation[lp__#0,b.1#1,temp_Intercept#2,b_shape.1#3,sd_1.1#4,sd_1_2#5,z_1.1.1#6,z_1.2.1#7,z_1.1.2#8,z_1.2.2#9,z_1.1.3#10,z_1.2.3#11,z_1.1.4#12,z_1.2.4#13,z_1.1.5#14,z_1.2.5#15,z_1.1.6#16,z_1.2.6#17,z_1.1.7#18,z_1.2.7#19,z_1.1.8#20,z_1.2.8#21,z_1.1.9#22,z_1.2.9#23,...
 294 more fields] csv

at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:282)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:292)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:296)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:296)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:301)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:301)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
at 

[jira] [Updated] (SPARK-20051) Fix StreamSuite.recover from v2.1 checkpoint failing with IOException

2017-03-21 Thread Kunal Khamar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Khamar updated SPARK-20051:
-
Description: There is a race condition between calling stop on a streaming 
query and deleting directories in withTempDir that causes test to fail, fixing 
to do lazy deletion using delete on shutdown JVM hook.  (was: There is a race 
condition with deleting directories in withTempDir that causes test to fail, 
fixing to do lazy deletion using delete on shutdown JVM hook.)

> Fix StreamSuite.recover from v2.1 checkpoint failing with IOException
> -
>
> Key: SPARK-20051
> URL: https://issues.apache.org/jira/browse/SPARK-20051
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kunal Khamar
>
> There is a race condition between calling stop on a streaming query and 
> deleting directories in withTempDir that causes test to fail, fixing to do 
> lazy deletion using delete on shutdown JVM hook.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20051) Fix StreamSuite.recover from v2.1 checkpoint failing with IOException

2017-03-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20051:


Assignee: (was: Apache Spark)

> Fix StreamSuite.recover from v2.1 checkpoint failing with IOException
> -
>
> Key: SPARK-20051
> URL: https://issues.apache.org/jira/browse/SPARK-20051
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kunal Khamar
>
> There is a race condition with deleting directories in withTempDir that 
> causes test to fail, fixing to do lazy deletion using delete on shutdown JVM 
> hook.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20051) Fix StreamSuite.recover from v2.1 checkpoint failing with IOException

2017-03-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20051:


Assignee: Apache Spark

> Fix StreamSuite.recover from v2.1 checkpoint failing with IOException
> -
>
> Key: SPARK-20051
> URL: https://issues.apache.org/jira/browse/SPARK-20051
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kunal Khamar
>Assignee: Apache Spark
>
> There is a race condition with deleting directories in withTempDir that 
> causes test to fail, fixing to do lazy deletion using delete on shutdown JVM 
> hook.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20051) Fix StreamSuite.recover from v2.1 checkpoint failing with IOException

2017-03-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935564#comment-15935564
 ] 

Apache Spark commented on SPARK-20051:
--

User 'kunalkhamar' has created a pull request for this issue:
https://github.com/apache/spark/pull/17382

> Fix StreamSuite.recover from v2.1 checkpoint failing with IOException
> -
>
> Key: SPARK-20051
> URL: https://issues.apache.org/jira/browse/SPARK-20051
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kunal Khamar
>
> There is a race condition with deleting directories in withTempDir that 
> causes test to fail, fixing to do lazy deletion using delete on shutdown JVM 
> hook.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20052) Some InputDStream needs closing processing after all batches processed when graceful shutdown

2017-03-21 Thread Sasaki Toru (JIRA)
Sasaki Toru created SPARK-20052:
---

 Summary: Some InputDStream needs closing processing after all 
batches processed when graceful shutdown
 Key: SPARK-20052
 URL: https://issues.apache.org/jira/browse/SPARK-20052
 Project: Spark
  Issue Type: Improvement
  Components: DStreams
Affects Versions: 2.2.0
Reporter: Sasaki Toru


Some class extend InputDStream needs closing processing after all batches 
processed when graceful shutdown enabled.
(e.g. When using Kafka as data source, need to commit processed offsets to 
Kafka Broker)

InputDStream has method 'stop' to stop receiving data, but this method will be 
called before processing last batches generated for graceful shutdown.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20051) Fix StreamSuite.recover from v2.1 checkpoint failing with IOException

2017-03-21 Thread Kunal Khamar (JIRA)
Kunal Khamar created SPARK-20051:


 Summary: Fix StreamSuite.recover from v2.1 checkpoint failing with 
IOException
 Key: SPARK-20051
 URL: https://issues.apache.org/jira/browse/SPARK-20051
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Kunal Khamar


There is a race condition with deleting directories in withTempDir that causes 
test to fail, fixing to do lazy deletion using delete on shutdown JVM hook.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20008) hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 1

2017-03-21 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935540#comment-15935540
 ] 

Hyukjin Kwon commented on SPARK-20008:
--

[~smilegator], it seems the discussion is about deuplicates in the result if I 
understood correctly.

The problem here is {{Set() - Set()}} should return empty {{Set()}} which was 
previously done
However, it seems now returning {{Set(Row())}} from empty dataframes.

In the current master,

{code}
scala> spark.emptyDataFrame.except(spark.emptyDataFrame).collect()
res0: Array[org.apache.spark.sql.Row] = Array([])

scala> spark.emptyDataFrame.collect()
res1: Array[org.apache.spark.sql.Row] = Array()
{code}

I thought S∖S=∅ as below:

{code}
scala> spark.range(1).except(spark.range(1)).collect()
res14: Array[Long] = Array()
{code}


> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 
> 1
> ---
>
> Key: SPARK-20008
> URL: https://issues.apache.org/jira/browse/SPARK-20008
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.2.0
>Reporter: Ravindra Bajpai
>
> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() yields 
> 1 against expected 0.
> This was not the case with spark 1.5.2. This is an api change from usage 
> point of view and hence I consider this as a bug. May be a boundary case, not 
> sure.
> Work around - For now I check the counts != 0 before this operation. Not good 
> for performance. Hence creating a jira to track it.
> As Young Zhang explained in reply to my mail - 
> Starting from Spark 2, these kind of operation are implemented in left anti 
> join, instead of using RDD operation directly.
> Same issue also on sqlContext.
> scala> spark.version
> res25: String = 2.0.2
> spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true)
> == Physical Plan ==
> *HashAggregate(keys=[], functions=[], output=[])
> +- Exchange SinglePartition
>+- *HashAggregate(keys=[], functions=[], output=[])
>   +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false
>  :- Scan ExistingRDD[]
>  +- BroadcastExchange IdentityBroadcastMode
> +- Scan ExistingRDD[]
> This arguably means a bug. But my guess is liking the logic of comparing NULL 
> = NULL, should it return true or false, causing this kind of confusion. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20047) Constrained Logistic Regression

2017-03-21 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-20047:

Description: 
For certain applications, such as stacked regressions, it is important to put 
non-negative constraints on the regression coefficients. Also, if the ranges of 
coefficients are known, it makes sense to constrain the coefficient search 
space.

Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ 
R^\{m×p\} and b ∈ R^\{m\} are predefined matrices and vectors which places a 
set of m linear constraints on the coefficients is very challenging as 
discussed in many literatures. 

However, for box constraints on the coefficients, the optimization is well 
solved. For gradient descent, people can projected gradient descent in the 
primal by zeroing the negative weights at each step. For LBFGS, an extended 
version of it, LBFGS-B can handle large scale box optimization efficiently. 
Unfortunately, for OWLQN, there is no good efficient way to do optimization 
with box constrains.

As a result, in this work, we only implement constrained LR with box constrains 
without L1 regularization. 

Note that since we standardize the data in training phase, so the coefficients 
seen in the optimization subroutine are in the scaled space; as a result, we 
need to convert the box constrains into scaled space.

Users will be able to set the lower / upper bounds of each coefficients and 
intercepts.

  was:
For certain applications, such as stacked regressions, it is important to put 
non-negative constraints on the regression coefficients. Also, if the ranges of 
coefficients are known, it makes sense to constrain the coefficient search 
space.

Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ 
R^\{m×p\} and b ∈ R^\{m\} are predefined matrices and vectors which places a 
set of m linear constraints on the coefficients is very challenging as 
discussed in many literatures. 

However, for box constraints on the coefficients, the optimization is well 
solved. For gradient descent, people can projected gradient descent in the 
primal by zeroing the negative weights at each step. For LBFGS, an extended 
version of it, LBFGS-B can handle large scale box optimization efficiently. 
Unfortunately, for OWLQN, there is no good efficient way to do optimization 
with box constrains.

As a result, in this work, we only implement constrained LR with box constrains 
without L1 regularization. 

Note that since we standardize the data in training phase, so the coefficients 
seen in the optimization subroutine are in the scaled space; as a result, we 
need to convert the box constrains into scaled space.

Users will be able to set the lower / upper bounds of each coefficients and 
intercepts.


 


One solution could be to modify these implementations and do a Projected 
Gradient Descent in the primal by zeroing the negative weights at each step. 
But this process is inconvenient because the nice convergence properties are 
then lost.






> Constrained Logistic Regression
> ---
>
> Key: SPARK-20047
> URL: https://issues.apache.org/jira/browse/SPARK-20047
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: DB Tsai
>Assignee: Yanbo Liang
>
> For certain applications, such as stacked regressions, it is important to put 
> non-negative constraints on the regression coefficients. Also, if the ranges 
> of coefficients are known, it makes sense to constrain the coefficient search 
> space.
> Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ 
> R^\{m×p\} and b ∈ R^\{m\} are predefined matrices and vectors which places a 
> set of m linear constraints on the coefficients is very challenging as 
> discussed in many literatures. 
> However, for box constraints on the coefficients, the optimization is well 
> solved. For gradient descent, people can projected gradient descent in the 
> primal by zeroing the negative weights at each step. For LBFGS, an extended 
> version of it, LBFGS-B can handle large scale box optimization efficiently. 
> Unfortunately, for OWLQN, there is no good efficient way to do optimization 
> with box constrains.
> As a result, in this work, we only implement constrained LR with box 
> constrains without L1 regularization. 
> Note that since we standardize the data in training phase, so the 
> coefficients seen in the optimization subroutine are in the scaled space; as 
> a result, we need to convert the box constrains into scaled space.
> Users will be able to set the lower / upper bounds of each coefficients and 
> intercepts.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: 

[jira] [Updated] (SPARK-20047) Constrained Logistic Regression

2017-03-21 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-20047:

Description: 
For certain applications, such as stacked regressions, it is important to put 
non-negative constraints on the regression coefficients. Also, if the ranges of 
coefficients are known, it makes sense to constrain the coefficient search 
space.

Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ 
R^\{m×p\} and b ∈ R^\{m\} are predefined matrices and vectors which places a 
set of m linear constraints on the coefficients is very challenging as 
discussed in many literatures. 

However, for box constraints on the coefficients, the optimization is well 
solved. For gradient descent, people can projected gradient descent in the 
primal by zeroing the negative weights at each step. For LBFGS, an extended 
version of it, LBFGS-B can handle large scale box optimization efficiently. 
Unfortunately, for OWLQN, there is no good efficient way to do optimization 
with box constrains.

As a result, in this work, we only implement constrained LR with box constrains 
without L1 regularization. 

Note that since we standardize the data in training phase, so the coefficients 
seen in the optimization subroutine are in the scaled space; as a result, we 
need to convert the box constrains into scaled space.

Users will be able to set the lower / upper bounds of each coefficients and 
intercepts.


 


One solution could be to modify these implementations and do a Projected 
Gradient Descent in the primal by zeroing the negative weights at each step. 
But this process is inconvenient because the nice convergence properties are 
then lost.





  was:
For certain applications, such as stacked regressions, it is important to put 
non-negative constraints on the regression coefficients. Also, if the ranges of 
coefficients are known, it makes sense to constrain the coefficient search 
space.

Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ 
R^\{m×p\} and b ∈ R^{m} are predefined matrices and vectors which places a
set of m linear constraints on the coefficients is very challenging as 
discussed in many literatures. 

However, for box constraints on the coefficients, the optimization is well 
solved. For gradient descent, people can projected gradient descent in the 
primal by zeroing the negative weights at each step. For LBFGS, an extended 
version of it, LBFGS-B can handle large scale box optimization efficiently. 
Unfortunately, for OWLQN, there is no good efficient way to do optimization 
with box constrains.

As a result, in this work, we only implement constrained LR with box constrains 
without L1 regularization. 

Note that since we standardize the data in training phase, so the coefficients 
seen in the optimization subroutine are in the scaled space; as a result, we 
need to convert the box constrains into scaled space.

Users will be able to set the lower / upper bounds of each coefficients and 
intercepts.


 


One solution could be to modify these implementations and do a Projected 
Gradient Descent in the primal by zeroing the negative weights at each step. 
But this process is inconvenient because the nice convergence properties are 
then lost.






> Constrained Logistic Regression
> ---
>
> Key: SPARK-20047
> URL: https://issues.apache.org/jira/browse/SPARK-20047
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: DB Tsai
>Assignee: Yanbo Liang
>
> For certain applications, such as stacked regressions, it is important to put 
> non-negative constraints on the regression coefficients. Also, if the ranges 
> of coefficients are known, it makes sense to constrain the coefficient search 
> space.
> Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ 
> R^\{m×p\} and b ∈ R^\{m\} are predefined matrices and vectors which places a 
> set of m linear constraints on the coefficients is very challenging as 
> discussed in many literatures. 
> However, for box constraints on the coefficients, the optimization is well 
> solved. For gradient descent, people can projected gradient descent in the 
> primal by zeroing the negative weights at each step. For LBFGS, an extended 
> version of it, LBFGS-B can handle large scale box optimization efficiently. 
> Unfortunately, for OWLQN, there is no good efficient way to do optimization 
> with box constrains.
> As a result, in this work, we only implement constrained LR with box 
> constrains without L1 regularization. 
> Note that since we standardize the data in training phase, so the 
> coefficients seen in the optimization subroutine are in the scaled space; as 
> a result, we need to convert the box constrains into scaled space.
> 

[jira] [Updated] (SPARK-20047) Constrained Logistic Regression

2017-03-21 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-20047:

Description: 
For certain applications, such as stacked regressions, it is important to put 
non-negative constraints on the regression coefficients. Also, if the ranges of 
coefficients are known, it makes sense to constrain the coefficient search 
space.

Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ 
R^\{m×p\} and b ∈ R^{m} are predefined matrices and vectors which places a
set of m linear constraints on the coefficients is very challenging as 
discussed in many literatures. 

However, for box constraints on the coefficients, the optimization is well 
solved. For gradient descent, people can projected gradient descent in the 
primal by zeroing the negative weights at each step. For LBFGS, an extended 
version of it, LBFGS-B can handle large scale box optimization efficiently. 
Unfortunately, for OWLQN, there is no good efficient way to do optimization 
with box constrains.

As a result, in this work, we only implement constrained LR with box constrains 
without L1 regularization. 

Note that since we standardize the data in training phase, so the coefficients 
seen in the optimization subroutine are in the scaled space; as a result, we 
need to convert the box constrains into scaled space.

Users will be able to set the lower / upper bounds of each coefficients and 
intercepts.


 


One solution could be to modify these implementations and do a Projected 
Gradient Descent in the primal by zeroing the negative weights at each step. 
But this process is inconvenient because the nice convergence properties are 
then lost.





  was:
For certain applications, such as stacked regressions, it is important to put 
non-negative constraints on the regression coefficients. Also, if the ranges of 
coefficients are known, it makes sense to constrain the coefficient search 
space.

Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ 
R^{m×p} and b ∈ R^{m} are predefined matrices and vectors which places a
set of m linear constraints on the coefficients is very challenging as 
discussed in many literatures. 

However, for box constraints on the coefficients, the optimization is well 
solved. For gradient descent, people can projected gradient descent in the 
primal by zeroing the negative weights at each step. For LBFGS, an extended 
version of it, LBFGS-B can handle large scale box optimization efficiently. 
Unfortunately, for OWLQN, there is no good efficient way to do optimization 
with box constrains.

As a result, in this work, we only implement constrained LR with box constrains 
without L1 regularization. 

Note that since we standardize the data in training phase, so the coefficients 
seen in the optimization subroutine are in the scaled space; as a result, we 
need to convert the box constrains into scaled space.

Users will be able to set the lower / upper bounds of each coefficients and 
intercepts.


 


One solution could be to modify these implementations and do a Projected 
Gradient Descent in the primal by zeroing the negative weights at each step. 
But this process is inconvenient because the nice convergence properties are 
then lost.






> Constrained Logistic Regression
> ---
>
> Key: SPARK-20047
> URL: https://issues.apache.org/jira/browse/SPARK-20047
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: DB Tsai
>Assignee: Yanbo Liang
>
> For certain applications, such as stacked regressions, it is important to put 
> non-negative constraints on the regression coefficients. Also, if the ranges 
> of coefficients are known, it makes sense to constrain the coefficient search 
> space.
> Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ 
> R^\{m×p\} and b ∈ R^{m} are predefined matrices and vectors which places a
> set of m linear constraints on the coefficients is very challenging as 
> discussed in many literatures. 
> However, for box constraints on the coefficients, the optimization is well 
> solved. For gradient descent, people can projected gradient descent in the 
> primal by zeroing the negative weights at each step. For LBFGS, an extended 
> version of it, LBFGS-B can handle large scale box optimization efficiently. 
> Unfortunately, for OWLQN, there is no good efficient way to do optimization 
> with box constrains.
> As a result, in this work, we only implement constrained LR with box 
> constrains without L1 regularization. 
> Note that since we standardize the data in training phase, so the 
> coefficients seen in the optimization subroutine are in the scaled space; as 
> a result, we need to convert the box constrains into scaled space.
> Users 

[jira] [Updated] (SPARK-20047) Constrained Logistic Regression

2017-03-21 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-20047:

Description: 
For certain applications, such as stacked regressions, it is important to put 
non-negative constraints on the regression coefficients. Also, if the ranges of 
coefficients are known, it makes sense to constrain the coefficient search 
space.

Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ 
R^{m×p} and b ∈ R^{m} are predefined matrices and vectors which places a
set of m linear constraints on the coefficients is very challenging as 
discussed in many literatures. 

However, for box constraints on the coefficients, the optimization is well 
solved. For gradient descent, people can projected gradient descent in the 
primal by zeroing the negative weights at each step. For LBFGS, an extended 
version of it, LBFGS-B can handle large scale box optimization efficiently. 
Unfortunately, for OWLQN, there is no good efficient way to do optimization 
with box constrains.

As a result, in this work, we only implement constrained LR with box constrains 
without L1 regularization. 

Note that since we standardize the data in training phase, so the coefficients 
seen in the optimization subroutine are in the scaled space; as a result, we 
need to convert the box constrains into scaled space.

Users will be able to set the lower / upper bounds of each coefficients and 
intercepts.


 


One solution could be to modify these implementations and do a Projected 
Gradient Descent in the primal by zeroing the negative weights at each step. 
But this process is inconvenient because the nice convergence properties are 
then lost.





> Constrained Logistic Regression
> ---
>
> Key: SPARK-20047
> URL: https://issues.apache.org/jira/browse/SPARK-20047
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: DB Tsai
>Assignee: Yanbo Liang
>
> For certain applications, such as stacked regressions, it is important to put 
> non-negative constraints on the regression coefficients. Also, if the ranges 
> of coefficients are known, it makes sense to constrain the coefficient search 
> space.
> Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ 
> R^{m×p} and b ∈ R^{m} are predefined matrices and vectors which places a
> set of m linear constraints on the coefficients is very challenging as 
> discussed in many literatures. 
> However, for box constraints on the coefficients, the optimization is well 
> solved. For gradient descent, people can projected gradient descent in the 
> primal by zeroing the negative weights at each step. For LBFGS, an extended 
> version of it, LBFGS-B can handle large scale box optimization efficiently. 
> Unfortunately, for OWLQN, there is no good efficient way to do optimization 
> with box constrains.
> As a result, in this work, we only implement constrained LR with box 
> constrains without L1 regularization. 
> Note that since we standardize the data in training phase, so the 
> coefficients seen in the optimization subroutine are in the scaled space; as 
> a result, we need to convert the box constrains into scaled space.
> Users will be able to set the lower / upper bounds of each coefficients and 
> intercepts.
>  
> One solution could be to modify these implementations and do a Projected 
> Gradient Descent in the primal by zeroing the negative weights at each step. 
> But this process is inconvenient because the nice convergence properties are 
> then lost.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20050) Kafka 0.10 DirectStream doesn't commit last processed batch's offset when graceful shutdown

2017-03-21 Thread Sasaki Toru (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sasaki Toru updated SPARK-20050:

Description: 
I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and 
call 'DirectKafkaInputDStream#commitAsync' finally in each batches,  such below

{code}
val kafkaStream = KafkaUtils.createDirectStream[String, String](...)

kafkaStream.map { input =>
  "key: " + input.key.toString + " value: " + input.value.toString + " offset: 
" + input.offset.toString
  }.foreachRDD { rdd =>
rdd.foreach { input =>
println(input)
  }
}

kafkaStream.foreachRDD { rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
  kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
{\code}

Some records which processed in the last batch before Streaming graceful 
shutdown reprocess in the first batch after Spark Streaming restart.

It may cause offsets specified in commitAsync will commit in the head of next 
batch.


  was:
I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and 
call 'DirectKafkaInputDStream#commitAsync' finally in each batches such below

{code}
val kafkaStream = KafkaUtils.createDirectStream[String, String](...)

kafkaStream.map { input =>
  "key: " + input.key.toString + " value: " + input.value.toString + " offset: 
" + input.offset.toString
  }.foreachRDD { rdd =>
rdd.foreach { input =>
println(input)
  }
}

kafkaStream.foreachRDD { rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
  kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
{\code}

Some records which processed in the last batch before Streaming graceful 
shutdown reprocess in the first batch after Spark Streaming restart.

It may cause offsets specified in commitAsync will commit in the head of next 
batch.


 Issue Type: Bug  (was: Improvement)

> Kafka 0.10 DirectStream doesn't commit last processed batch's offset when 
> graceful shutdown
> ---
>
> Key: SPARK-20050
> URL: https://issues.apache.org/jira/browse/SPARK-20050
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Sasaki Toru
>
> I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and 
> call 'DirectKafkaInputDStream#commitAsync' finally in each batches,  such 
> below
> {code}
> val kafkaStream = KafkaUtils.createDirectStream[String, String](...)
> kafkaStream.map { input =>
>   "key: " + input.key.toString + " value: " + input.value.toString + " 
> offset: " + input.offset.toString
>   }.foreachRDD { rdd =>
> rdd.foreach { input =>
> println(input)
>   }
> }
> kafkaStream.foreachRDD { rdd =>
>   val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
>   kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
> }
> {\code}
> Some records which processed in the last batch before Streaming graceful 
> shutdown reprocess in the first batch after Spark Streaming restart.
> It may cause offsets specified in commitAsync will commit in the head of next 
> batch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20050) Kafka 0.10 DirectStream doesn't commit last processed batch's offset when graceful shutdown

2017-03-21 Thread Sasaki Toru (JIRA)
Sasaki Toru created SPARK-20050:
---

 Summary: Kafka 0.10 DirectStream doesn't commit last processed 
batch's offset when graceful shutdown
 Key: SPARK-20050
 URL: https://issues.apache.org/jira/browse/SPARK-20050
 Project: Spark
  Issue Type: Improvement
  Components: DStreams
Affects Versions: 2.2.0
Reporter: Sasaki Toru


I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and 
call 'DirectKafkaInputDStream#commitAsync' finally in each batches such below

{code}
val kafkaStream = KafkaUtils.createDirectStream[String, String](...)

kafkaStream.map { input =>
  "key: " + input.key.toString + " value: " + input.value.toString + " offset: 
" + input.offset.toString
  }.foreachRDD { rdd =>
rdd.foreach { input =>
println(input)
  }
}

kafkaStream.foreachRDD { rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
  kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
{\code}

Some records which processed in the last batch before Streaming graceful 
shutdown reprocess in the first batch after Spark Streaming restart.

It may cause offsets specified in commitAsync will commit in the head of next 
batch.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20023) Can not see table comment when describe formatted table

2017-03-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20023:


Assignee: Apache Spark  (was: Xiao Li)

> Can not see table comment when describe formatted table
> ---
>
> Key: SPARK-20023
> URL: https://issues.apache.org/jira/browse/SPARK-20023
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: chenerlu
>Assignee: Apache Spark
>
> Spark 2.x implements create table by itself.
> https://github.com/apache/spark/commit/7d2ed8cc030f3d84fea47fded072c320c3d87ca7
> But in the implement mentioned above, it remove table comment from 
> properties, so user can not see table comment through run "describe formatted 
> table". Similarly, when user alters table comment, he still can not see the 
> change of table comment through run "describe formatted table".
> I wonder why we removed table comments, is this a bug?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20023) Can not see table comment when describe formatted table

2017-03-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20023:


Assignee: Xiao Li  (was: Apache Spark)

> Can not see table comment when describe formatted table
> ---
>
> Key: SPARK-20023
> URL: https://issues.apache.org/jira/browse/SPARK-20023
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: chenerlu
>Assignee: Xiao Li
>
> Spark 2.x implements create table by itself.
> https://github.com/apache/spark/commit/7d2ed8cc030f3d84fea47fded072c320c3d87ca7
> But in the implement mentioned above, it remove table comment from 
> properties, so user can not see table comment through run "describe formatted 
> table". Similarly, when user alters table comment, he still can not see the 
> change of table comment through run "describe formatted table".
> I wonder why we removed table comments, is this a bug?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20023) Can not see table comment when describe formatted table

2017-03-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935507#comment-15935507
 ] 

Apache Spark commented on SPARK-20023:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/17381

> Can not see table comment when describe formatted table
> ---
>
> Key: SPARK-20023
> URL: https://issues.apache.org/jira/browse/SPARK-20023
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: chenerlu
>Assignee: Xiao Li
>
> Spark 2.x implements create table by itself.
> https://github.com/apache/spark/commit/7d2ed8cc030f3d84fea47fded072c320c3d87ca7
> But in the implement mentioned above, it remove table comment from 
> properties, so user can not see table comment through run "describe formatted 
> table". Similarly, when user alters table comment, he still can not see the 
> change of table comment through run "describe formatted table".
> I wonder why we removed table comments, is this a bug?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19408) cardinality estimation involving two columns of the same table

2017-03-21 Thread Ron Hu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Hu updated SPARK-19408:
---
Target Version/s: 2.3.0  (was: 2.2.0)

> cardinality estimation involving two columns of the same table
> --
>
> Key: SPARK-19408
> URL: https://issues.apache.org/jira/browse/SPARK-19408
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Ron Hu
>
> In SPARK-17075, we estimate cardinality of predicate expression "column (op) 
> literal", where op is =, <, <=, >, or >=.  In SQL queries, we also see 
> predicate expressions involving two columns such as "column-1 (op) column-2" 
> where column-1 and column-2 belong to same table.  Note that, if column-1 and 
> column-2 belong to different tables, then it is a join operator's work, NOT a 
> filter operator's work.
> In this jira, we want to estimate the filter factor of predicate expressions 
> involving two columns of same table.   



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20004) Spark thrift server ovewrites spark.app.name

2017-03-21 Thread Bo Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935472#comment-15935472
 ] 

Bo Meng edited comment on SPARK-20004 at 3/21/17 10:32 PM:
---

I think you can still use --name for your app name. for example, 
/spark/sbin/start-thriftserver.sh --name="My server 1"


was (Author: bomeng):
I think you can still use --name for your app name. for example, 
/spark/sbin/start-thriftserver.sh --conf spark.yarn.queue=spark.client.$host 
--conf spark.app.name="ODBC server $host"

> Spark thrift server ovewrites spark.app.name
> 
>
> Key: SPARK-20004
> URL: https://issues.apache.org/jira/browse/SPARK-20004
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Egor Pahomov
>Priority: Minor
>
> {code}
> export SPARK_YARN_APP_NAME="ODBC server $host"
> /spark/sbin/start-thriftserver.sh --conf spark.yarn.queue=spark.client.$host 
> --conf spark.app.name="ODBC server $host"
> {code}
> And spark-defaults.conf contains: 
> {code}
> spark.app.name "ODBC server spark01"
> {code}
> Still name in yarn is "Thrift JDBC/ODBC Server"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20004) Spark thrift server ovewrites spark.app.name

2017-03-21 Thread Bo Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935472#comment-15935472
 ] 

Bo Meng commented on SPARK-20004:
-

I think you can still use --name for your app name. for example, 
/spark/sbin/start-thriftserver.sh --conf spark.yarn.queue=spark.client.$host 
--conf spark.app.name="ODBC server $host"

> Spark thrift server ovewrites spark.app.name
> 
>
> Key: SPARK-20004
> URL: https://issues.apache.org/jira/browse/SPARK-20004
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Egor Pahomov
>Priority: Minor
>
> {code}
> export SPARK_YARN_APP_NAME="ODBC server $host"
> /spark/sbin/start-thriftserver.sh --conf spark.yarn.queue=spark.client.$host 
> --conf spark.app.name="ODBC server $host"
> {code}
> And spark-defaults.conf contains: 
> {code}
> spark.app.name "ODBC server spark01"
> {code}
> Still name in yarn is "Thrift JDBC/ODBC Server"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20049) Writing data to Parquet with partitions takes very long after the job finishes

2017-03-21 Thread Jakub Nowacki (JIRA)
Jakub Nowacki created SPARK-20049:
-

 Summary: Writing data to Parquet with partitions takes very long 
after the job finishes
 Key: SPARK-20049
 URL: https://issues.apache.org/jira/browse/SPARK-20049
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, PySpark, SQL
Affects Versions: 2.1.0
 Environment: Spark 2.1.0, CDH 5.8, Python 3.4, Java 8, Debian 
GNU/Linux 8.7 (jessie)
Reporter: Jakub Nowacki


I was testing writing DataFrame to partitioned Parquet files.The command is 
quite straight forward and the data set is really a sample from larger data set 
in Parquet; the job is done in PySpark on YARN and written to HDFS:
{code}
# there is column 'date' in df
df.write.partitionBy("date").parquet("dest_dir")
{code}
The reading part took as long as usual, but after the job has been marked in 
PySpark and UI as finished, the Python interpreter still was showing it as 
busy. Indeed, when I checked the HDFS folder I noticed that the files are still 
transferred from {{dest_dir/_temporary}} to all the {{dest_dir/date=*}} 
folders. 

First of all it takes much longer than saving the same set without 
partitioning. Second, it is done in the background, without visible progress of 
any kind. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4296) Throw "Expression not in GROUP BY" when using same expression in group by clause and select clause

2017-03-21 Thread Irina Truong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935409#comment-15935409
 ] 

Irina Truong edited comment on SPARK-4296 at 3/21/17 10:01 PM:
---

I have the same exception with pyspark when my expression uses a compiled and 
registered Scala UDF. This is how it's registered:


{noformat}
sqlContext.registerJavaFunction("round_date", 'my.package.RoundDate')
{noformat}


And this is how it's called:

{noformat}
ipdb> sqlContext.sql("SELECT round_date(t.ts, '1day') from (select 
timestamp('2017-02-02T10:11:12') as ts union select 
timestamp('2017-02-02T10:19:00') as ts) as t group by round_date(t.ts, 
'1day')").show()
*** AnalysisException: u"expression 't.`ts`' is neither present in the group 
by, nor is it an aggregate function. Add to group by or wrap in first() (or 
first_value) if you don't care which value you get.;;\nAggregate [UDF(ts#80, 
1day)], [UDF(ts#80, 1day) AS UDF(ts, 1day)#82]\n+- SubqueryAlias t\n   +- 
Distinct\n  +- Union\n :- Project [cast(2017-02-02T10:11:12 as 
timestamp) AS ts#80]\n :  +- OneRowRelation$\n +- Project 
[cast(2017-02-02T10:19:00 as timestamp) AS ts#81]\n+- 
OneRowRelation$\n"
{noformat}


was (Author: irinatruong):
I have the same exception with pyspark when my expression uses a compiled and 
registered Scala UDF. This is how it's registered:

{noformat}
sqlContext.registerJavaFunction("round_date", 'my.package.RoundDate')
{noformat}

And this is how it's called:

{noformat}
ipdb> sqlContext.sql("SELECT round_date(t.ts, '1day') from (select 
timestamp('2017-02-02T10:11:12') as ts union select 
timestamp('2017-02-02T10:19:00') as ts) as t group by round_date(t.ts, 
'1day')").show()
*** AnalysisException: u"expression 't.`ts`' is neither present in the group 
by, nor is it an aggregate function. Add to group by or wrap in first() (or 
first_value) if you don't care which value you get.;;\nAggregate [UDF(ts#80, 
1day)], [UDF(ts#80, 1day) AS UDF(ts, 1day)#82]\n+- SubqueryAlias t\n   +- 
Distinct\n  +- Union\n :- Project [cast(2017-02-02T10:11:12 as 
timestamp) AS ts#80]\n :  +- OneRowRelation$\n +- Project 
[cast(2017-02-02T10:19:00 as timestamp) AS ts#81]\n+- 
OneRowRelation$\n"
{noformat}

> Throw "Expression not in GROUP BY" when using same expression in group by 
> clause and  select clause
> ---
>
> Key: SPARK-4296
> URL: https://issues.apache.org/jira/browse/SPARK-4296
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0, 1.1.1, 1.2.0
>Reporter: Shixiong Zhu
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 1.2.1, 1.3.0
>
>
> When the input data has a complex structure, using same expression in group 
> by clause and  select clause will throw "Expression not in GROUP BY".
> {code:java}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.createSchemaRDD
> case class Birthday(date: String)
> case class Person(name: String, birthday: Birthday)
> val people = sc.parallelize(List(Person("John", Birthday("1990-01-22")), 
> Person("Jim", Birthday("1980-02-28"
> people.registerTempTable("people")
> val year = sqlContext.sql("select count(*), upper(birthday.date) from people 
> group by upper(birthday.date)")
> year.collect
> {code}
> Here is the plan of year:
> {code:java}
> SchemaRDD[3] at RDD at SchemaRDD.scala:105
> == Query Plan ==
> == Physical Plan ==
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression 
> not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree:
> Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date 
> AS date#9) AS c1#3]
>  Subquery people
>   LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at 
> ExistingRDD.scala:36
> {code}
> The bug is the equality test for `Upper(birthday#1.date)` and 
> `Upper(birthday#1.date AS date#9)`.
> Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias 
> expression.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4296) Throw "Expression not in GROUP BY" when using same expression in group by clause and select clause

2017-03-21 Thread Irina Truong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935409#comment-15935409
 ] 

Irina Truong edited comment on SPARK-4296 at 3/21/17 9:59 PM:
--

I have the same exception with pyspark when my expression uses a compiled and 
registered Scala UDF. This is how it's registered:

{noformat}
sqlContext.registerJavaFunction("round_date", 'my.package.RoundDate')
{noformat}

And this is how it's called:

{noformat}
ipdb> sqlContext.sql("SELECT round_date(t.ts, '1day') from (select 
timestamp('2017-02-02T10:11:12') as ts union select 
timestamp('2017-02-02T10:19:00') as ts) as t group by round_date(t.ts, 
'1day')").show()
*** AnalysisException: u"expression 't.`ts`' is neither present in the group 
by, nor is it an aggregate function. Add to group by or wrap in first() (or 
first_value) if you don't care which value you get.;;\nAggregate [UDF(ts#80, 
1day)], [UDF(ts#80, 1day) AS UDF(ts, 1day)#82]\n+- SubqueryAlias t\n   +- 
Distinct\n  +- Union\n :- Project [cast(2017-02-02T10:11:12 as 
timestamp) AS ts#80]\n :  +- OneRowRelation$\n +- Project 
[cast(2017-02-02T10:19:00 as timestamp) AS ts#81]\n+- 
OneRowRelation$\n"
{noformat}


was (Author: irinatruong):
I'm have the same exception with pyspark when my expression uses a compiled and 
registered Scala UDF:

sqlContext.registerJavaFunction("round_date", 'my.package.RoundDate')

ipdb> sqlContext.sql("SELECT round_date(t.ts, '1day') from (select 
timestamp('2017-02-02T10:11:12') as ts union select 
timestamp('2017-02-02T10:19:00') as ts) as t group by round_date(t.ts, 
'1day')").show()
*** AnalysisException: u"expression 't.`ts`' is neither present in the group 
by, nor is it an aggregate function. Add to group by or wrap in first() (or 
first_value) if you don't care which value you get.;;\nAggregate [UDF(ts#80, 
1day)], [UDF(ts#80, 1day) AS UDF(ts, 1day)#82]\n+- SubqueryAlias t\n   +- 
Distinct\n  +- Union\n :- Project [cast(2017-02-02T10:11:12 as 
timestamp) AS ts#80]\n :  +- OneRowRelation$\n +- Project 
[cast(2017-02-02T10:19:00 as timestamp) AS ts#81]\n+- 
OneRowRelation$\n"




> Throw "Expression not in GROUP BY" when using same expression in group by 
> clause and  select clause
> ---
>
> Key: SPARK-4296
> URL: https://issues.apache.org/jira/browse/SPARK-4296
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0, 1.1.1, 1.2.0
>Reporter: Shixiong Zhu
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 1.2.1, 1.3.0
>
>
> When the input data has a complex structure, using same expression in group 
> by clause and  select clause will throw "Expression not in GROUP BY".
> {code:java}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.createSchemaRDD
> case class Birthday(date: String)
> case class Person(name: String, birthday: Birthday)
> val people = sc.parallelize(List(Person("John", Birthday("1990-01-22")), 
> Person("Jim", Birthday("1980-02-28"
> people.registerTempTable("people")
> val year = sqlContext.sql("select count(*), upper(birthday.date) from people 
> group by upper(birthday.date)")
> year.collect
> {code}
> Here is the plan of year:
> {code:java}
> SchemaRDD[3] at RDD at SchemaRDD.scala:105
> == Query Plan ==
> == Physical Plan ==
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression 
> not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree:
> Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date 
> AS date#9) AS c1#3]
>  Subquery people
>   LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at 
> ExistingRDD.scala:36
> {code}
> The bug is the equality test for `Upper(birthday#1.date)` and 
> `Upper(birthday#1.date AS date#9)`.
> Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias 
> expression.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4296) Throw "Expression not in GROUP BY" when using same expression in group by clause and select clause

2017-03-21 Thread Irina Truong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935409#comment-15935409
 ] 

Irina Truong commented on SPARK-4296:
-

I'm have the same exception with pyspark when my expression uses a compiled and 
registered Scala UDF:

sqlContext.registerJavaFunction("round_date", 'my.package.RoundDate')

ipdb> sqlContext.sql("SELECT round_date(t.ts, '1day') from (select 
timestamp('2017-02-02T10:11:12') as ts union select 
timestamp('2017-02-02T10:19:00') as ts) as t group by round_date(t.ts, 
'1day')").show()
*** AnalysisException: u"expression 't.`ts`' is neither present in the group 
by, nor is it an aggregate function. Add to group by or wrap in first() (or 
first_value) if you don't care which value you get.;;\nAggregate [UDF(ts#80, 
1day)], [UDF(ts#80, 1day) AS UDF(ts, 1day)#82]\n+- SubqueryAlias t\n   +- 
Distinct\n  +- Union\n :- Project [cast(2017-02-02T10:11:12 as 
timestamp) AS ts#80]\n :  +- OneRowRelation$\n +- Project 
[cast(2017-02-02T10:19:00 as timestamp) AS ts#81]\n+- 
OneRowRelation$\n"




> Throw "Expression not in GROUP BY" when using same expression in group by 
> clause and  select clause
> ---
>
> Key: SPARK-4296
> URL: https://issues.apache.org/jira/browse/SPARK-4296
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0, 1.1.1, 1.2.0
>Reporter: Shixiong Zhu
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 1.2.1, 1.3.0
>
>
> When the input data has a complex structure, using same expression in group 
> by clause and  select clause will throw "Expression not in GROUP BY".
> {code:java}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.createSchemaRDD
> case class Birthday(date: String)
> case class Person(name: String, birthday: Birthday)
> val people = sc.parallelize(List(Person("John", Birthday("1990-01-22")), 
> Person("Jim", Birthday("1980-02-28"
> people.registerTempTable("people")
> val year = sqlContext.sql("select count(*), upper(birthday.date) from people 
> group by upper(birthday.date)")
> year.collect
> {code}
> Here is the plan of year:
> {code:java}
> SchemaRDD[3] at RDD at SchemaRDD.scala:105
> == Query Plan ==
> == Physical Plan ==
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression 
> not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree:
> Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date 
> AS date#9) AS c1#3]
>  Subquery people
>   LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at 
> ExistingRDD.scala:36
> {code}
> The bug is the equality test for `Upper(birthday#1.date)` and 
> `Upper(birthday#1.date AS date#9)`.
> Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias 
> expression.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19237) SparkR package on Windows waiting for a long time when no java is found launching spark-submit

2017-03-21 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman reassigned SPARK-19237:
-

Assignee: Felix Cheung

> SparkR package on Windows waiting for a long time when no java is found 
> launching spark-submit
> --
>
> Key: SPARK-19237
> URL: https://issues.apache.org/jira/browse/SPARK-19237
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.1.1, 2.2.0
>
>
> When installing SparkR as a R package (install.packages) on Windows, it will 
> check for Spark distribution and automatically download and cache it. But if 
> there is no java runtime on the machine spark-submit will just hang.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19237) SparkR package on Windows waiting for a long time when no java is found launching spark-submit

2017-03-21 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-19237.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Issue resolved by pull request 16596
[https://github.com/apache/spark/pull/16596]

> SparkR package on Windows waiting for a long time when no java is found 
> launching spark-submit
> --
>
> Key: SPARK-19237
> URL: https://issues.apache.org/jira/browse/SPARK-19237
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
> Fix For: 2.1.1, 2.2.0
>
>
> When installing SparkR as a R package (install.packages) on Windows, it will 
> check for Spark distribution and automatically download and cache it. But if 
> there is no java runtime on the machine spark-submit will just hang.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20048) Cloning SessionState does not clone query execution listeners

2017-03-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20048:


Assignee: Apache Spark

> Cloning SessionState does not clone query execution listeners
> -
>
> Key: SPARK-20048
> URL: https://issues.apache.org/jira/browse/SPARK-20048
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kunal Khamar
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20048) Cloning SessionState does not clone query execution listeners

2017-03-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935355#comment-15935355
 ] 

Apache Spark commented on SPARK-20048:
--

User 'kunalkhamar' has created a pull request for this issue:
https://github.com/apache/spark/pull/17379

> Cloning SessionState does not clone query execution listeners
> -
>
> Key: SPARK-20048
> URL: https://issues.apache.org/jira/browse/SPARK-20048
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kunal Khamar
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20048) Cloning SessionState does not clone query execution listeners

2017-03-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20048:


Assignee: (was: Apache Spark)

> Cloning SessionState does not clone query execution listeners
> -
>
> Key: SPARK-20048
> URL: https://issues.apache.org/jira/browse/SPARK-20048
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kunal Khamar
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20048) Cloning SessionState does not clone query execution listeners

2017-03-21 Thread Kunal Khamar (JIRA)
Kunal Khamar created SPARK-20048:


 Summary: Cloning SessionState does not clone query execution 
listeners
 Key: SPARK-20048
 URL: https://issues.apache.org/jira/browse/SPARK-20048
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Kunal Khamar






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20023) Can not see table comment when describe formatted table

2017-03-21 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-20023:
---

Assignee: Xiao Li

> Can not see table comment when describe formatted table
> ---
>
> Key: SPARK-20023
> URL: https://issues.apache.org/jira/browse/SPARK-20023
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: chenerlu
>Assignee: Xiao Li
>
> Spark 2.x implements create table by itself.
> https://github.com/apache/spark/commit/7d2ed8cc030f3d84fea47fded072c320c3d87ca7
> But in the implement mentioned above, it remove table comment from 
> properties, so user can not see table comment through run "describe formatted 
> table". Similarly, when user alters table comment, he still can not see the 
> change of table comment through run "describe formatted table".
> I wonder why we removed table comments, is this a bug?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20023) Can not see table comment when describe formatted table

2017-03-21 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935320#comment-15935320
 ] 

Xiao Li commented on SPARK-20023:
-

{{DESC EXTENDED}} works. Obviously, {{DESC FORMATTED}} has a bug

> Can not see table comment when describe formatted table
> ---
>
> Key: SPARK-20023
> URL: https://issues.apache.org/jira/browse/SPARK-20023
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: chenerlu
>
> Spark 2.x implements create table by itself.
> https://github.com/apache/spark/commit/7d2ed8cc030f3d84fea47fded072c320c3d87ca7
> But in the implement mentioned above, it remove table comment from 
> properties, so user can not see table comment through run "describe formatted 
> table". Similarly, when user alters table comment, he still can not see the 
> change of table comment through run "describe formatted table".
> I wonder why we removed table comments, is this a bug?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib

2017-03-21 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935304#comment-15935304
 ] 

Miao Wang commented on SPARK-19634:
---

Comments never come to email box. [~timhunter] I can continue with your code. 
Let me check it out. Thanks!


> Feature parity for descriptive statistics in MLlib
> --
>
> Key: SPARK-19634
> URL: https://issues.apache.org/jira/browse/SPARK-19634
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
>
> This ticket tracks porting the functionality of 
> spark.mllib.MultivariateOnlineSummarizer over to spark.ml.
> A design has been discussed in SPARK-19208 . Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20047) Constrained Logistic Regression

2017-03-21 Thread DB Tsai (JIRA)
DB Tsai created SPARK-20047:
---

 Summary: Constrained Logistic Regression
 Key: SPARK-20047
 URL: https://issues.apache.org/jira/browse/SPARK-20047
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 2.1.0
Reporter: DB Tsai
Assignee: Yanbo Liang






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20008) hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 1

2017-03-21 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935278#comment-15935278
 ] 

Xiao Li commented on SPARK-20008:
-

See the discussion https://github.com/apache/spark/pull/12736#r61344182

The behavior of the previous EXCEPT is wrong. 

> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 
> 1
> ---
>
> Key: SPARK-20008
> URL: https://issues.apache.org/jira/browse/SPARK-20008
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.2.0
>Reporter: Ravindra Bajpai
>
> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() yields 
> 1 against expected 0.
> This was not the case with spark 1.5.2. This is an api change from usage 
> point of view and hence I consider this as a bug. May be a boundary case, not 
> sure.
> Work around - For now I check the counts != 0 before this operation. Not good 
> for performance. Hence creating a jira to track it.
> As Young Zhang explained in reply to my mail - 
> Starting from Spark 2, these kind of operation are implemented in left anti 
> join, instead of using RDD operation directly.
> Same issue also on sqlContext.
> scala> spark.version
> res25: String = 2.0.2
> spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true)
> == Physical Plan ==
> *HashAggregate(keys=[], functions=[], output=[])
> +- Exchange SinglePartition
>+- *HashAggregate(keys=[], functions=[], output=[])
>   +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false
>  :- Scan ExistingRDD[]
>  +- BroadcastExchange IdentityBroadcastMode
> +- Scan ExistingRDD[]
> This arguably means a bug. But my guess is liking the logic of comparing NULL 
> = NULL, should it return true or false, causing this kind of confusion. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20009) Use user-friendly DDL formats for defining a schema in user-facing APIs

2017-03-21 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935264#comment-15935264
 ] 

Xiao Li commented on SPARK-20009:
-

Are you suggesting to change the semantics of the parameter of the external API?

We are unable to break the existing one. Maybe we can support both? Try to 
detect whether it is in the JSON format. If not, we can try to parse it as the 
DDL format?

> Use user-friendly DDL formats for defining a schema  in user-facing APIs
> 
>
> Key: SPARK-20009
> URL: https://issues.apache.org/jira/browse/SPARK-20009
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>
> In https://issues.apache.org/jira/browse/SPARK-19830, we add a new API in the 
> DDL parser to convert a DDL string into a schema. Then, we can use DDL 
> formats in existing some APIs, e.g., functions.from_json 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3062.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20044) Support Spark UI behind front-end reverse proxy using a path prefix

2017-03-21 Thread Alex Bozarth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935223#comment-15935223
 ] 

Alex Bozarth commented on SPARK-20044:
--

I like this idea in theory, but I worried it would take a large sweeping code 
change to work. If you have an implication idea already I would suggest opening 
a pr. For me, accepting this would hinge on how it's implemented, I'd rather 
not add lots of new code across the entire web ui.

[~vanzin] and [~tgraves] what do you guys think, you helped review the reverse 
proxy pr

> Support Spark UI behind front-end reverse proxy using a path prefix
> ---
>
> Key: SPARK-20044
> URL: https://issues.apache.org/jira/browse/SPARK-20044
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Oliver Koeth
>Priority: Minor
>  Labels: reverse-proxy, sso
>
> Purpose: allow to run the Spark web UI behind a reverse proxy with URLs 
> prefixed by a context root, like www.mydomain.com/spark. In particular, this 
> allows to access multiple Spark clusters through the same virtual host, only 
> distinguishing them by context root, like www.mydomain.com/cluster1, 
> www.mydomain.com/cluster2, and it allows to run the Spark UI in a common 
> cookie domain (for SSO) with other services.
> [SPARK-15487] introduced some support for front-end reverse proxies by 
> allowing all Spark UI requests to be routed through the master UI as a single 
> endpoint and also added a spark.ui.reverseProxyUrl setting to define a 
> another proxy sitting in front of Spark. However, as noted in the comments on 
> [SPARK-15487], this mechanism does not currently work if the reverseProxyUrl 
> includes a context root like the examples above: Most links generated by the 
> Spark UI result in full path URLs (like /proxy/app-"id"/...) that do not 
> account for a path prefix (context root) and work only if the Spark UI "owns" 
> the entire virtual host. In fact, the only place in the UI where the 
> reverseProxyUrl seems to be used is the back-link from the worker UI to the 
> master UI.
> The discussion on [SPARK-15487] proposes to open a new issue for the problem, 
> but that does not seem to have happened, so this issue aims to address the 
> remaining shortcomings of spark.ui.reverseProxyUrl
> The problem can be partially worked around by doing content rewrite in a 
> front-end proxy and prefixing src="/..." or href="/..." links with a context 
> root. However, detecting and patching URLs in HTML output is not a robust 
> approach and breaks down for URLs included in custom REST responses. E.g. the 
> "allexecutors" REST call used from the Spark 2.1.0 application/executors page 
> returns links for log viewing that direct to the worker UI and do not work in 
> this scenario.
> This issue proposes to honor spark.ui.reverseProxyUrl throughout Spark UI URL 
> generation. Experiments indicate that most of this can simply be achieved by 
> using/prepending spark.ui.reverseProxyUrl to the existing spark.ui.proxyBase 
> system property. Beyond that, the places that require adaption are
> - worker and application links in the master web UI
> - webui URLs returned by REST interfaces
> Note: It seems that returned redirect location headers do not need to be 
> adapted, since URL rewriting for these is commonly done in front-end proxies 
> and has a well-defined interface



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20046) Facilitate loop optimizations in a JIT compiler regarding sqlContext.read.parquet()

2017-03-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20046:


Assignee: Apache Spark

> Facilitate loop optimizations in a JIT compiler regarding 
> sqlContext.read.parquet()
> ---
>
> Key: SPARK-20046
> URL: https://issues.apache.org/jira/browse/SPARK-20046
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>
> [This 
> article|https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html]
>  suggests that better generated code can improve performance by facilitating 
> compiler optimizations.
> This JIRA changes the generated code for {{sqlContext.read.parquet("file")}} 
> to facilitate loop optimizations in a JIT compiler for achieving better 
> performance. In particular, [this stackoverflow 
> entry|http://stackoverflow.com/questions/40629435/fast-parquet-row-count-in-spark]
>  suggests me to improve performance of 
> {{sqlContext.read.parquet("file").count}}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20046) Facilitate loop optimizations in a JIT compiler regarding sqlContext.read.parquet()

2017-03-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20046:


Assignee: (was: Apache Spark)

> Facilitate loop optimizations in a JIT compiler regarding 
> sqlContext.read.parquet()
> ---
>
> Key: SPARK-20046
> URL: https://issues.apache.org/jira/browse/SPARK-20046
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> [This 
> article|https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html]
>  suggests that better generated code can improve performance by facilitating 
> compiler optimizations.
> This JIRA changes the generated code for {{sqlContext.read.parquet("file")}} 
> to facilitate loop optimizations in a JIT compiler for achieving better 
> performance. In particular, [this stackoverflow 
> entry|http://stackoverflow.com/questions/40629435/fast-parquet-row-count-in-spark]
>  suggests me to improve performance of 
> {{sqlContext.read.parquet("file").count}}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20046) Facilitate loop optimizations in a JIT compiler regarding sqlContext.read.parquet()

2017-03-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935162#comment-15935162
 ] 

Apache Spark commented on SPARK-20046:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/17378

> Facilitate loop optimizations in a JIT compiler regarding 
> sqlContext.read.parquet()
> ---
>
> Key: SPARK-20046
> URL: https://issues.apache.org/jira/browse/SPARK-20046
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> [This 
> article|https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html]
>  suggests that better generated code can improve performance by facilitating 
> compiler optimizations.
> This JIRA changes the generated code for {{sqlContext.read.parquet("file")}} 
> to facilitate loop optimizations in a JIT compiler for achieving better 
> performance. In particular, [this stackoverflow 
> entry|http://stackoverflow.com/questions/40629435/fast-parquet-row-count-in-spark]
>  suggests me to improve performance of 
> {{sqlContext.read.parquet("file").count}}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20046) Facilitate loop optimizations in a JIT compiler regarding sqlContext.read.parquet()

2017-03-21 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-20046:
-
Issue Type: Improvement  (was: Bug)

> Facilitate loop optimizations in a JIT compiler regarding 
> sqlContext.read.parquet()
> ---
>
> Key: SPARK-20046
> URL: https://issues.apache.org/jira/browse/SPARK-20046
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> [This 
> article|https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html]
>  suggests that better generated code can improve performance by facilitating 
> compiler optimizations.
> This JIRA changes the generated code for {{sqlContext.read.parquet("file")}} 
> to facilitate loop optimizations in a JIT compiler for achieving better 
> performance. In particular, [this stackoverflow 
> entry|http://stackoverflow.com/questions/40629435/fast-parquet-row-count-in-spark]
>  suggests me to improve performance of 
> {{sqlContext.read.parquet("file").count}}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20046) Facilitate loop optimizations in a JIT compiler regarding sqlContext.read.parquet()

2017-03-21 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-20046:


 Summary: Facilitate loop optimizations in a JIT compiler regarding 
sqlContext.read.parquet()
 Key: SPARK-20046
 URL: https://issues.apache.org/jira/browse/SPARK-20046
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Kazuaki Ishizaki


[This 
article|https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html]
 suggests that better generated code can improve performance by facilitating 
compiler optimizations.
This JIRA changes the generated code for {{sqlContext.read.parquet("file")}} to 
facilitate loop optimizations in a JIT compiler for achieving better 
performance. In particular, [this stackoverflow 
entry|http://stackoverflow.com/questions/40629435/fast-parquet-row-count-in-spark]
 suggests me to improve performance of 
{{sqlContext.read.parquet("file").count}}}.





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17136) Design optimizer interface for ML algorithms

2017-03-21 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935112#comment-15935112
 ] 

Seth Hendrickson commented on SPARK-17136:
--

The reason to support setting them in both places would be backwards 
compatibility mainly. If we still allow users to set {{maxIter}} on the 
estimator then we won't break code that previously did this. Specifying the 
optimizer, either one built into Spark or a custom one, would be optional and 
something mostly advanced users would do. About grid-based CV, this would be a 
point that we need to carefully consider and make sure that we get it right. 
We'd still allow users to search over grids of {{maxIter}}, {{tol}} etc... 
since those params are still there, but additionally users could search over 
different optimizers and optimizers with different parameters themselves. I 
think that could be a bit clunky, but it's open for design discussion. e.g.

{code}
val paramGrid = new ParamGridBuilder()
  .addGrid(lr.minimizer, Array(new LBFGS(), new OWLQN(), new LBFGSB(lb, ub)))
  .build()
{code}

Yes, there are cases where users could supply conflicting grids, but AFAICT 
this problem already exists, e.g. 

{code}
val paramGrid = new ParamGridBuilder()
  .addGrid(lr.solver, Array("normal", "l-bfgs"))
  .addGrid(lr.maxIter, Array(10, 20)) // maxIter is ignored when solver is 
normal
  .build()
{code}

About your suggestion of mimicking Spark SQL - would you mind elaborating here 
or on the design doc? I'm not as familiar with it, so if you have some design 
in mind it would be great to hear that.



> Design optimizer interface for ML algorithms
> 
>
> Key: SPARK-17136
> URL: https://issues.apache.org/jira/browse/SPARK-17136
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> We should consider designing an interface that allows users to use their own 
> optimizers in some of the ML algorithms, similar to MLlib. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20017) Functions "str_to_map" and "explode" throws NPE exceptioin

2017-03-21 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20017:

Labels:   (was: correctness)

> Functions "str_to_map" and "explode" throws NPE exceptioin
> --
>
> Key: SPARK-20017
> URL: https://issues.apache.org/jira/browse/SPARK-20017
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: roncenzhao
>Assignee: roncenzhao
> Fix For: 2.1.1, 2.2.0
>
> Attachments: screenshot-1.png
>
>
> ```
> val sqlDf = spark.sql("select k,v from (select str_to_map('') as map_col from 
> range(2)) tbl lateral view explode(map_col) as k,v")
> sqlDf.show
> ```
> The sql throws NPE exception.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20017) Functions "str_to_map" and "explode" throws NPE exceptioin

2017-03-21 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20017.
-
   Resolution: Fixed
 Assignee: roncenzhao
Fix Version/s: 2.2.0
   2.1.1

> Functions "str_to_map" and "explode" throws NPE exceptioin
> --
>
> Key: SPARK-20017
> URL: https://issues.apache.org/jira/browse/SPARK-20017
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: roncenzhao
>Assignee: roncenzhao
>  Labels: correctness
> Fix For: 2.1.1, 2.2.0
>
> Attachments: screenshot-1.png
>
>
> ```
> val sqlDf = spark.sql("select k,v from (select str_to_map('') as map_col from 
> range(2)) tbl lateral view explode(map_col) as k,v")
> sqlDf.show
> ```
> The sql throws NPE exception.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20016) SparkLauncher submit job failed after setConf with special charaters under windows

2017-03-21 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935005#comment-15935005
 ] 

Marcelo Vanzin commented on SPARK-20016:


This was a long time ago and mostly trial & error, since Windows batch files 
make no sense. Since I don't really have a Windows test env anymore, I'd 
appreciated if someone who does have one can try things out.

> SparkLauncher submit job failed after setConf with special charaters under 
> windows
> --
>
> Key: SPARK-20016
> URL: https://issues.apache.org/jira/browse/SPARK-20016
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.0
> Environment: windows 7, 8, 10, 2008, 2008R2, etc.
>Reporter: Vincent Sun
>
> I am using sparkLauncher JAVA API to submit job to a remote spark cluster 
> master.  Codes looks like follow:
> /*
> * launch Job
> */
> public static void launch() throws Exception {
> SparkLauncher spark = new SparkLauncher();  
> spark.setAppName("sparkdemo").setAppResource("hdfs://10.250.1.121:9000/application.jar").setMainClass("test.Application");
> spark.setMaster(spark://10.250.1.120:6066);
> spark.setDeployMode("cluster");
> spark.setConf("spark.executor.cores","2") 
> spark.setConf("spark.executor.memory","8G") 
> spark.startApplication(new MyAppListener(job.getAppName()));
>   }
> It works fine under Linux/CentOS, but failed on my own desktop which is a 
> windows 8 OS. It will throw out error:
> [launcher-proc-1] The filename, directory name, or volume label syntax is 
> incorrect.
> The finial command I caught is this:
> spark-submit.cmd  --master spark://10.250.1.120:6066 --deploy-mode cluster 
> --name sparkdemo --conf "spark.executor.memory=8G" --conf 
> "spark.executor.cores=2"  --class test.Application 
> hdfs://10.250.1.121:9000/application.jar
> The quote on spark.executor.memory=8G and spark.executor.cores=2 cause the 
> exception.
> After debug into the source code I found the reason is at:
> quoteForBatchScript method of CommandBuilderUtils class
> It will add quotes while there is '=' or some other kinds of special 
> characters under windows system. Here is the source codes:
> static String quoteForBatchScript(String arg) {
> boolean needsQuotes = false;
> for (int i = 0; i < arg.length(); i++) {
>   int c = arg.codePointAt(i);
>   if (Character.isWhitespace(c) || c == '"' || c == '=' || c == ',' || c 
> == ';') {
> needsQuotes = true;
> break;
>   }
> }
> if (!needsQuotes) {
>   return arg;
> }
> StringBuilder quoted = new StringBuilder();
> quoted.append("\"");
> for (int i = 0; i < arg.length(); i++) {
>   int cp = arg.codePointAt(i);
>   switch (cp) {
>   case '"':
> quoted.append('"');
> break;
>   default:
> break;
>   }
>   quoted.appendCodePoint(cp);
> }
> if (arg.codePointAt(arg.length() - 1) == '\\') {
>   quoted.append("\\");
> }
> quoted.append("\"");
> return quoted.toString();
>   }



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20039) Rename ml.stat.ChiSquare to ml.stat.ChiSquareTest

2017-03-21 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20039:
--
Priority: Minor  (was: Major)

> Rename ml.stat.ChiSquare to ml.stat.ChiSquareTest
> -
>
> Key: SPARK-20039
> URL: https://issues.apache.org/jira/browse/SPARK-20039
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
> Fix For: 2.2.0
>
>
> I realized that since {{ChiSquare}} is in the package {{stat}}, it's pretty 
> unclear if it's the hypothesis test, distribution, or what.  I plan to rename 
> it to {{ChiSquareTest}} to clarify this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20039) Rename ml.stat.ChiSquare to ml.stat.ChiSquareTest

2017-03-21 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-20039.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17368
[https://github.com/apache/spark/pull/17368]

> Rename ml.stat.ChiSquare to ml.stat.ChiSquareTest
> -
>
> Key: SPARK-20039
> URL: https://issues.apache.org/jira/browse/SPARK-20039
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
> Fix For: 2.2.0
>
>
> I realized that since {{ChiSquare}} is in the package {{stat}}, it's pretty 
> unclear if it's the hypothesis test, distribution, or what.  I plan to rename 
> it to {{ChiSquareTest}} to clarify this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7129) Add generic boosting algorithm to spark.ml

2017-03-21 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15934960#comment-15934960
 ] 

Seth Hendrickson commented on SPARK-7129:
-

I don't think anyone is working on it. Though I'm afraid it is probably not a 
good use of time to spend on this task, for a couple of reasons. We still don't 
have weight support in trees and there is extremely limited bandwidth of 
reviewers/committers in Spark ML at the moment. Further, there are many more 
important tasks that need to be done in ML so I would rate this as low 
priority, which also means it is less likely to be reviewed or see much 
progress. Finally, given the recent success of things like xgboost/lightGBM, we 
may want to rethink/rewrite the existing boosting framework to see if we can 
get similar performance. If anything, I think we need to think about how we'd 
like to proceed improving the boosting libraries in Spark from an overall point 
of view, but that is a large task that is likely a few releases away. I'd be 
curious to hear others' thoughts of course, but this is the state of things 
AFAIK. I guess I don't see this as a priority, but it could become one given 
enough community interest.

> Add generic boosting algorithm to spark.ml
> --
>
> Key: SPARK-7129
> URL: https://issues.apache.org/jira/browse/SPARK-7129
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> The Pipelines API will make it easier to create a generic Boosting algorithm 
> which can work with any Classifier or Regressor. Creating this feature will 
> require researching the possible variants and extensions of boosting which we 
> may want to support now and/or in the future, and planning an API which will 
> be properly extensible.
> In particular, it will be important to think about supporting:
> * multiple loss functions (for AdaBoost, LogitBoost, gradient boosting, etc.)
> * multiclass variants
> * multilabel variants (which will probably be in a separate class and JIRA)
> * For more esoteric variants, we should consider them but not design too much 
> around them: totally corrective boosting, cascaded models
> Note: This may interact some with the existing tree ensemble methods, but it 
> should be largely separate since the tree ensemble APIs and implementations 
> are specialized for trees.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17121) Support _HOST replacement for principal

2017-03-21 Thread Chris Gianelloni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15934908#comment-15934908
 ] 

Chris Gianelloni commented on SPARK-17121:
--

I find this useful when configuring Spark HistorySever to write to a 
Kerberos-enabled HDFS.

> Support _HOST replacement for principal
> ---
>
> Key: SPARK-17121
> URL: https://issues.apache.org/jira/browse/SPARK-17121
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Priority: Minor
>
> _HOST is a placeholder for the host which is used widely for hadoop 
> components (like NN/DN/RM/NM and etc), this is useful for automatic 
> configuration of some cluster deployment tool. It is would be nice that spark 
> can also support this, especially useful for spark thrift server.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20043) CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" on ML random forest and decision. Only "gini" and "entropy" (in lower case) are accepted

2017-03-21 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15934905#comment-15934905
 ] 

Nick Pentreath commented on SPARK-20043:


I just noticed the error message you put above says "Entorpy" - is that a 
spelling mistake in the JIRA description or in your code?

> CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" 
> on ML random forest and decision. Only "gini" and "entropy" (in lower case) 
> are accepted
> 
>
> Key: SPARK-20043
> URL: https://issues.apache.org/jira/browse/SPARK-20043
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Zied Sellami
>
> I saved a CrossValidatorModel with a decision tree and a random forest. I use 
> Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not 
> able to load the saved model, when impurity are written not in lowercase. I 
> obtain an error from Spark "impurity Gini (Entorpy) not recognized.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20043) CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" on ML random forest and decision. Only "gini" and "entropy" (in lower case) are accepted

2017-03-21 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-20043:
---
Docs Text:   (was: I saved a CrossValidatorModel with a decision tree and a 
random forest. I use Paramgrid to test "gini" and "entropy" impurity. 
CrossValidatorModel are not able to load the saved model, when impurity are 
written not in lowercase. I obtain an error from Spark "impurity Gini (Entorpy) 
not recognized.)

> CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" 
> on ML random forest and decision. Only "gini" and "entropy" (in lower case) 
> are accepted
> 
>
> Key: SPARK-20043
> URL: https://issues.apache.org/jira/browse/SPARK-20043
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Zied Sellami
>
> I saved a CrossValidatorModel with a decision tree and a random forest. I use 
> Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not 
> able to load the saved model, when impurity are written not in lowercase. I 
> obtain an error from Spark "impurity Gini (Entorpy) not recognized.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20043) CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" on ML random forest and decision. Only "gini" and "entropy" (in lower case) are accepted

2017-03-21 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-20043:
---
Description: 
I saved a CrossValidatorModel with a decision tree and a random forest. I use 
Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not 
able to load the saved model, when impurity are written not in lowercase. I 
obtain an error from Spark "impurity Gini (Entorpy) not recognized.


> CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" 
> on ML random forest and decision. Only "gini" and "entropy" (in lower case) 
> are accepted
> 
>
> Key: SPARK-20043
> URL: https://issues.apache.org/jira/browse/SPARK-20043
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Zied Sellami
>
> I saved a CrossValidatorModel with a decision tree and a random forest. I use 
> Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not 
> able to load the saved model, when impurity are written not in lowercase. I 
> obtain an error from Spark "impurity Gini (Entorpy) not recognized.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19934) code comments are not very clearly in BlackListTracker.scala

2017-03-21 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15934869#comment-15934869
 ] 

Imran Rashid commented on SPARK-19934:
--

technically, you are right, that "another" isn't really correct, it depends on 
the configs ... but I think this is a pretty insignificant change.  The comment 
is more about "why" then the exact logic, which is best described by code 
anyhow.

> code comments are not very clearly in BlackListTracker.scala
> 
>
> Key: SPARK-19934
> URL: https://issues.apache.org/jira/browse/SPARK-19934
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: zhoukang
>Priority: Trivial
>
> {code}
> def handleRemovedExecutor(executorId: String): Unit = {
> // We intentionally do not clean up executors that are already 
> blacklisted in
> // nodeToBlacklistedExecs, so that if another executor on the same node 
> gets blacklisted, we can
> // blacklist the entire node.  We also can't clean up 
> executorIdToBlacklistStatus, so we can
> // eventually remove the executor after the timeout.  Despite not 
> clearing those structures
> // here, we don't expect they will grow too big since you won't get too 
> many executors on one
> // node, and the timeout will clear it up periodically in any case.
> executorIdToFailureList -= executorId
>   }
> {code}
> I think the comments should be:
> {code}
> // We intentionally do not clean up executors that are already blacklisted in
> // nodeToBlacklistedExecs, so that if 
> {spark.blacklist.application.maxFailedExecutorsPerNode} - 1 executor on the 
> same node gets blacklisted, we can
> // blacklist the entire node.
> {code}
> Reference from the design doc 
> https://docs.google.com/document/d/1R2CVKctUZG9xwD67jkRdhBR4sCgccPR2dhTYSRXFEmg/edit.
> when consider update a node to application-level blacklist,should follow rule:
> Nodes are placed into a blacklist for the entire application when the number 
> of blacklisted executors goes over 
> spark.blacklist.application.maxFailedExecutorsPerNode (default 2)
> and the comment just explain as default value



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7129) Add generic boosting algorithm to spark.ml

2017-03-21 Thread Mohamed Baddar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15934864#comment-15934864
 ] 

Mohamed Baddar commented on SPARK-7129:
---

[~josephkb] [~sethah] [~meihuawu] [~mlnick] If now one is working on this. Can 
I start working on it, I have small experience in contributing with starter 
tasks in Spark. If no one working on it I would love to start reading the 
design docs mentioned in comments and start discussing next steps


> Add generic boosting algorithm to spark.ml
> --
>
> Key: SPARK-7129
> URL: https://issues.apache.org/jira/browse/SPARK-7129
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> The Pipelines API will make it easier to create a generic Boosting algorithm 
> which can work with any Classifier or Regressor. Creating this feature will 
> require researching the possible variants and extensions of boosting which we 
> may want to support now and/or in the future, and planning an API which will 
> be properly extensible.
> In particular, it will be important to think about supporting:
> * multiple loss functions (for AdaBoost, LogitBoost, gradient boosting, etc.)
> * multiclass variants
> * multilabel variants (which will probably be in a separate class and JIRA)
> * For more esoteric variants, we should consider them but not design too much 
> around them: totally corrective boosting, cascaded models
> Note: This may interact some with the existing tree ensemble methods, but it 
> should be largely separate since the tree ensemble APIs and implementations 
> are specialized for trees.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19261) Support `ALTER TABLE table_name ADD COLUMNS(..)` statement

2017-03-21 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19261.
-
   Resolution: Fixed
 Assignee: Xin Wu
Fix Version/s: 2.2.0

> Support `ALTER TABLE table_name ADD COLUMNS(..)` statement
> --
>
> Key: SPARK-19261
> URL: https://issues.apache.org/jira/browse/SPARK-19261
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: StanZhai
>Assignee: Xin Wu
> Fix For: 2.2.0
>
>
> We should support `ALTER TABLE table_name ADD COLUMNS(..)` statement, which 
> already be used in version < 2.x.
> This is very useful for those who want to upgrade there Spark version to 2.x.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20041) Update docs for NaN handling in approxQuantile

2017-03-21 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20041.
-
   Resolution: Fixed
 Assignee: zhengruifeng
Fix Version/s: 2.2.0

> Update docs for NaN handling in approxQuantile
> --
>
> Key: SPARK-20041
> URL: https://issues.apache.org/jira/browse/SPARK-20041
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Trivial
> Fix For: 2.2.0
>
>
> {{approxQuantile}} in R and Python now support multi-column, and the current 
> note about NaN handling is out of date:
> {{Note that rows containing any null values will be removed before 
> calculation.}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17136) Design optimizer interface for ML algorithms

2017-03-21 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15934736#comment-15934736
 ] 

Yanbo Liang edited comment on SPARK-17136 at 3/21/17 3:22 PM:
--

[~sethah] Thanks for the design doc.
One quick question: In your design, if we set the parameters in optimizer, Do 
we still support setting these parameters in estimator again?
If yes, why we need to support two entrances for the same set of params? I saw 
you reply at the design doc, you propose to make the params in optimizer 
superior to the ones in estimator. Does it involves confusion for users and 
extra maintenance cost?
Does the grid search-based model selection in the current framework (such as 
CrossValidator) can still work well? 
I'm more prefer to keep these params in estimators, make the optimizer layer as 
an internal API, and users can implement their own optimizer like Spark SQL 
external data source support. Since I found this is more aligned with the 
original [ML pipeline 
design|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit#]
 which stores params outside a pipeline component.
Thanks.


was (Author: yanboliang):
[~sethah] Thanks for the design doc.
One quick question: In your design, if we set the parameters in optimizer, Do 
we still support setting these parameters in estimator again?
If yes, why we need to support two entrances for the same set of params? I saw 
you reply at the design doc, you propose to make the params in optimizer 
superior to the ones in estimator. Does it involves confusion for users and 
extra maintenance cost?
Does the grid search-based model selection in the current framework (such as 
CrossValidator) can still work well? 
I'm more prefer to keep these params in estimators, make the optimizer layer as 
an internal API, and users can register their own optimizer implementation like 
Spark SQL external data source support. Since I found this is more aligned with 
the original [ML pipeline 
design|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit#]
 which stores params outside a pipeline component.
Thanks.

> Design optimizer interface for ML algorithms
> 
>
> Key: SPARK-17136
> URL: https://issues.apache.org/jira/browse/SPARK-17136
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> We should consider designing an interface that allows users to use their own 
> optimizers in some of the ML algorithms, similar to MLlib. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17136) Design optimizer interface for ML algorithms

2017-03-21 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15934736#comment-15934736
 ] 

Yanbo Liang edited comment on SPARK-17136 at 3/21/17 3:18 PM:
--

[~sethah] Thanks for the design doc.
One quick question: In your design, if we set the parameters in optimizer, Do 
we still support setting these parameters in estimator again?
If yes, why we need to support two entrances for the same set of params? I saw 
you reply at the design doc, you propose to make the params in optimizer 
superior to the ones in estimator. Does it involves confusion for users and 
extra maintenance cost?
Does the grid search-based model selection in the current framework (such as 
CrossValidator) can still work well? 
I'm more prefer to keep these params in estimators, make the optimizer layer as 
an internal API, and users can register their own optimizer implementation like 
Spark SQL external data source support. Since I found this is more aligned with 
the original [ML pipeline 
design|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit#]
 which stores params outside a pipeline component.
Thanks.


was (Author: yanboliang):
[~sethah] Thanks for the design doc.
One quick question: In your design, if we set the parameters in optimizer, Do 
we still support setting these parameters in estimator again?
If yes, why we need to support two entrances for the same set of params? I saw 
you reply at the design doc, you propose to make the params in optimizer 
superior to the ones in estimator. Does it involves confusion for users and 
extra maintenance cost?
Does the grid search-based model selection in the current framework (such as 
CrossValidator) can still work well? 
I'm more prefer to keep these params in estimators, make the optimizer layer as 
an internal API, and users can register their own optimizer implementation such 
as the data source support. Since I found this is more aligned with the 
original [ML pipeline 
design|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit#]
 which stores params outside a pipeline component.
Thanks.

> Design optimizer interface for ML algorithms
> 
>
> Key: SPARK-17136
> URL: https://issues.apache.org/jira/browse/SPARK-17136
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> We should consider designing an interface that allows users to use their own 
> optimizers in some of the ML algorithms, similar to MLlib. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17136) Design optimizer interface for ML algorithms

2017-03-21 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15934736#comment-15934736
 ] 

Yanbo Liang edited comment on SPARK-17136 at 3/21/17 3:17 PM:
--

[~sethah] Thanks for the design doc.
One quick question: In your design, if we set the parameters in optimizer, Do 
we still support setting these parameters in estimator again?
If yes, why we need to support two entrances for the same set of params? I saw 
you reply at the design doc, you propose to make the params in optimizer 
superior to the ones in estimator. Does it involves confusion for users and 
extra maintenance cost?
Does the grid search-based model selection in the current framework (such as 
CrossValidator) can still work well? Thanks.
I'm more prefer to keep these params in estimators, make the optimizer layer as 
an internal API, and users can register their own optimizer implementation such 
as the data source support. Since I found this is more aligned with the 
original [ML pipeline 
design|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit#]
 which stores params outside a pipeline component.



was (Author: yanboliang):
[~sethah] Thanks for the design doc.
One quick question: In your design, if we set the parameters in optimizer, Do 
we still support setting these parameters in estimator again?
If yes, why we need to support two entrances for the same set of params? I saw 
you reply at the design doc, you propose to make the params in optimizer 
superior to the ones in estimator. Does it involves confusion for users and 
extra maintenance cost?
Does the grid search-based model selection in the current framework (such as 
CrossValidator) can still work well? Thanks.


> Design optimizer interface for ML algorithms
> 
>
> Key: SPARK-17136
> URL: https://issues.apache.org/jira/browse/SPARK-17136
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> We should consider designing an interface that allows users to use their own 
> optimizers in some of the ML algorithms, similar to MLlib. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17136) Design optimizer interface for ML algorithms

2017-03-21 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15934736#comment-15934736
 ] 

Yanbo Liang edited comment on SPARK-17136 at 3/21/17 3:17 PM:
--

[~sethah] Thanks for the design doc.
One quick question: In your design, if we set the parameters in optimizer, Do 
we still support setting these parameters in estimator again?
If yes, why we need to support two entrances for the same set of params? I saw 
you reply at the design doc, you propose to make the params in optimizer 
superior to the ones in estimator. Does it involves confusion for users and 
extra maintenance cost?
Does the grid search-based model selection in the current framework (such as 
CrossValidator) can still work well? 
I'm more prefer to keep these params in estimators, make the optimizer layer as 
an internal API, and users can register their own optimizer implementation such 
as the data source support. Since I found this is more aligned with the 
original [ML pipeline 
design|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit#]
 which stores params outside a pipeline component.
Thanks.


was (Author: yanboliang):
[~sethah] Thanks for the design doc.
One quick question: In your design, if we set the parameters in optimizer, Do 
we still support setting these parameters in estimator again?
If yes, why we need to support two entrances for the same set of params? I saw 
you reply at the design doc, you propose to make the params in optimizer 
superior to the ones in estimator. Does it involves confusion for users and 
extra maintenance cost?
Does the grid search-based model selection in the current framework (such as 
CrossValidator) can still work well? Thanks.
I'm more prefer to keep these params in estimators, make the optimizer layer as 
an internal API, and users can register their own optimizer implementation such 
as the data source support. Since I found this is more aligned with the 
original [ML pipeline 
design|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit#]
 which stores params outside a pipeline component.


> Design optimizer interface for ML algorithms
> 
>
> Key: SPARK-17136
> URL: https://issues.apache.org/jira/browse/SPARK-17136
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> We should consider designing an interface that allows users to use their own 
> optimizers in some of the ML algorithms, similar to MLlib. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19998) BlockRDD block not found Exception add RDD id info

2017-03-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-19998:
-

Assignee: jianran.tfh

> BlockRDD block not found Exception add RDD id info
> --
>
> Key: SPARK-19998
> URL: https://issues.apache.org/jira/browse/SPARK-19998
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 2.1.0
>Reporter: jianran.tfh
>Assignee: jianran.tfh
>Priority: Trivial
> Fix For: 2.2.0
>
>
> "java.lang.Exception: Could not compute split, block $blockId not found" 
> doesn't have the rdd id info,  the "BlockManager: Removing RDD $id" has only 
> the RDD id, so it couldn't find that the Exception's reason is the Removing; 
> so it's better  block not found Exception add RDD id info 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19998) BlockRDD block not found Exception add RDD id info

2017-03-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19998.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17334
[https://github.com/apache/spark/pull/17334]

> BlockRDD block not found Exception add RDD id info
> --
>
> Key: SPARK-19998
> URL: https://issues.apache.org/jira/browse/SPARK-19998
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 2.1.0
>Reporter: jianran.tfh
>Assignee: jianran.tfh
>Priority: Trivial
> Fix For: 2.2.0
>
>
> "java.lang.Exception: Could not compute split, block $blockId not found" 
> doesn't have the rdd id info,  the "BlockManager: Removing RDD $id" has only 
> the RDD id, so it couldn't find that the Exception's reason is the Removing; 
> so it's better  block not found Exception add RDD id info 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17136) Design optimizer interface for ML algorithms

2017-03-21 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15934736#comment-15934736
 ] 

Yanbo Liang commented on SPARK-17136:
-

[~sethah] Thanks for the design doc.
One quick question: In your design, if we set the parameters in optimizer, Do 
we still support setting these parameters in estimator again?
If yes, why we need to support two entrances for the same set of params? I saw 
you reply at the design doc, you propose to make the params in optimizer 
superior to the ones in estimator. Does it involves confusion for users and 
extra maintenance cost?
Does the grid search-based model selection in the current framework (such as 
CrossValidator) can still work well? Thanks.


> Design optimizer interface for ML algorithms
> 
>
> Key: SPARK-17136
> URL: https://issues.apache.org/jira/browse/SPARK-17136
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> We should consider designing an interface that allows users to use their own 
> optimizers in some of the ML algorithms, similar to MLlib. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19950) nullable ignored when df.load() is executed for file-based data source

2017-03-21 Thread Jason White (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15934719#comment-15934719
 ] 

Jason White commented on SPARK-19950:
-

Without something that allows us to read using the nullable as exists on-disk, 
we end doing:
df = spark.read.parquet(path)
return spark.createDataFrame(df.rdd, schema)

Which is obviously not desirable. We would much rather rely on the schema as 
defined by the file format (Parquet in our case), or rely on a user-supplied 
schema. Preferably both.

> nullable ignored when df.load() is executed for file-based data source
> --
>
> Key: SPARK-19950
> URL: https://issues.apache.org/jira/browse/SPARK-19950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> This problem is reported in [Databricks 
> forum|https://forums.databricks.com/questions/7123/nullable-seemingly-ignored-when-reading-parquet.html].
> When we execute the following code, a schema for "id" in {{dfRead}} has 
> {{nullable = true}}. It should be {{nullable = false}}.
> {code:java}
> val field = "id"
> val df = spark.range(0, 5, 1, 1).toDF(field)
> val fmt = "parquet"
> val path = "/tmp/parquet"
> val schema = StructType(Seq(StructField(field, LongType, false)))
> df.write.format(fmt).mode("overwrite").save(path)
> val dfRead = spark.read.format(fmt).schema(schema).load(path)
> dfRead.printSchema
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19949) unify bad record handling in CSV and JSON

2017-03-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15934710#comment-15934710
 ] 

Apache Spark commented on SPARK-19949:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/17377

> unify bad record handling in CSV and JSON
> -
>
> Key: SPARK-19949
> URL: https://issues.apache.org/jira/browse/SPARK-19949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12664) Expose raw prediction scores in MultilayerPerceptronClassificationModel

2017-03-21 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-12664:
---

Assignee: Weichen Xu  (was: Yanbo Liang)

> Expose raw prediction scores in MultilayerPerceptronClassificationModel
> ---
>
> Key: SPARK-12664
> URL: https://issues.apache.org/jira/browse/SPARK-12664
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Robert Dodier
>Assignee: Weichen Xu
>
> In 
> org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel, 
> there isn't any way to get raw prediction scores; only an integer output 
> (from 0 to #classes - 1) is available via the `predict` method. 
> `mplModel.predict` is called within the class to get the raw score, but 
> `mlpModel` is private so that isn't available to outside callers.
> The raw score is useful when the user wants to interpret the classifier 
> output as a probability. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20041) Update docs for NaN handling in approxQuantile

2017-03-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20041:


Assignee: Apache Spark

> Update docs for NaN handling in approxQuantile
> --
>
> Key: SPARK-20041
> URL: https://issues.apache.org/jira/browse/SPARK-20041
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Trivial
>
> {{approxQuantile}} in R and Python now support multi-column, and the current 
> note about NaN handling is out of date:
> {{Note that rows containing any null values will be removed before 
> calculation.}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20041) Update docs for NaN handling in approxQuantile

2017-03-21 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-20041:


 Summary: Update docs for NaN handling in approxQuantile
 Key: SPARK-20041
 URL: https://issues.apache.org/jira/browse/SPARK-20041
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SparkR
Affects Versions: 2.2.0
Reporter: zhengruifeng
Priority: Trivial


{{approxQuantile}} in R and Python now support multi-column, and the current 
note about NaN handling is out of date:
{{Note that rows containing any null values will be removed before 
calculation.}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org