[jira] [Commented] (SPARK-20052) Some InputDStream needs closing processing after processing all batches when graceful shutdown
[ https://issues.apache.org/jira/browse/SPARK-20052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935789#comment-15935789 ] Sasaki Toru commented on SPARK-20052: - My explain is not good, sorry. This ticket is related to SPARK-20050. In JobGenerate#stop, it will wait for finishing all batches after InputDStream#stop called when graceful shutdown is enable, but Kafka 0.10 DirectStream should commit offset after processing all batches. So I thought more process(I explained this "closing process") is needed after processing all batches. > Some InputDStream needs closing processing after processing all batches when > graceful shutdown > -- > > Key: SPARK-20052 > URL: https://issues.apache.org/jira/browse/SPARK-20052 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Sasaki Toru > > Some class extend InputDStream needs closing processing after processing all > batches when graceful shutdown enabled. > (e.g. When using Kafka as data source, need to commit processed offsets to > Kafka Broker) > InputDStream has method 'stop' to stop receiving data, but this method will > be called before processing last batches generated for graceful shutdown. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19925) SparkR spark.getSparkFiles fails on executor
[ https://issues.apache.org/jira/browse/SPARK-19925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-19925: Fix Version/s: 2.1.1 > SparkR spark.getSparkFiles fails on executor > > > Key: SPARK-19925 > URL: https://issues.apache.org/jira/browse/SPARK-19925 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Critical > Fix For: 2.1.1, 2.2.0 > > Attachments: error-log > > > SparkR function {{spark.getSparkFiles}} fails when it was called on > executors. For examples, the following R code will fail. (See error logs in > attachment.) > {code} > spark.addFile("./README.md") > seq <- seq(from = 1, to = 10, length.out = 5) > train <- function(seq) { > path <- spark.getSparkFiles("README.md") > print(path) > } > spark.lapply(seq, train) > {code} > However, we can run successfully with Scala API: > {code} > import org.apache.spark.SparkFiles > sc.addFile("./README.md”) > sc.parallelize(Seq(0)).map{ _ => SparkFiles.get("README.md")}.first() > {code} > and also successfully with Python API: > {code} > from pyspark import SparkFiles > sc.addFile("./README.md") > sc.parallelize(range(1)).map(lambda x: SparkFiles.get("README.md")).first() > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19925) SparkR spark.getSparkFiles fails on executor
[ https://issues.apache.org/jira/browse/SPARK-19925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-19925. - Resolution: Fixed > SparkR spark.getSparkFiles fails on executor > > > Key: SPARK-19925 > URL: https://issues.apache.org/jira/browse/SPARK-19925 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Critical > Fix For: 2.1.1, 2.2.0 > > Attachments: error-log > > > SparkR function {{spark.getSparkFiles}} fails when it was called on > executors. For examples, the following R code will fail. (See error logs in > attachment.) > {code} > spark.addFile("./README.md") > seq <- seq(from = 1, to = 10, length.out = 5) > train <- function(seq) { > path <- spark.getSparkFiles("README.md") > print(path) > } > spark.lapply(seq, train) > {code} > However, we can run successfully with Scala API: > {code} > import org.apache.spark.SparkFiles > sc.addFile("./README.md”) > sc.parallelize(Seq(0)).map{ _ => SparkFiles.get("README.md")}.first() > {code} > and also successfully with Python API: > {code} > from pyspark import SparkFiles > sc.addFile("./README.md") > sc.parallelize(range(1)).map(lambda x: SparkFiles.get("README.md")).first() > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19925) SparkR spark.getSparkFiles fails on executor
[ https://issues.apache.org/jira/browse/SPARK-19925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-19925: Target Version/s: 2.2.0 Fix Version/s: 2.2.0 > SparkR spark.getSparkFiles fails on executor > > > Key: SPARK-19925 > URL: https://issues.apache.org/jira/browse/SPARK-19925 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Critical > Fix For: 2.2.0 > > Attachments: error-log > > > SparkR function {{spark.getSparkFiles}} fails when it was called on > executors. For examples, the following R code will fail. (See error logs in > attachment.) > {code} > spark.addFile("./README.md") > seq <- seq(from = 1, to = 10, length.out = 5) > train <- function(seq) { > path <- spark.getSparkFiles("README.md") > print(path) > } > spark.lapply(seq, train) > {code} > However, we can run successfully with Scala API: > {code} > import org.apache.spark.SparkFiles > sc.addFile("./README.md”) > sc.parallelize(Seq(0)).map{ _ => SparkFiles.get("README.md")}.first() > {code} > and also successfully with Python API: > {code} > from pyspark import SparkFiles > sc.addFile("./README.md") > sc.parallelize(range(1)).map(lambda x: SparkFiles.get("README.md")).first() > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19925) SparkR spark.getSparkFiles fails on executor
[ https://issues.apache.org/jira/browse/SPARK-19925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang reassigned SPARK-19925: --- Assignee: Yanbo Liang > SparkR spark.getSparkFiles fails on executor > > > Key: SPARK-19925 > URL: https://issues.apache.org/jira/browse/SPARK-19925 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Critical > Fix For: 2.2.0 > > Attachments: error-log > > > SparkR function {{spark.getSparkFiles}} fails when it was called on > executors. For examples, the following R code will fail. (See error logs in > attachment.) > {code} > spark.addFile("./README.md") > seq <- seq(from = 1, to = 10, length.out = 5) > train <- function(seq) { > path <- spark.getSparkFiles("README.md") > print(path) > } > spark.lapply(seq, train) > {code} > However, we can run successfully with Scala API: > {code} > import org.apache.spark.SparkFiles > sc.addFile("./README.md”) > sc.parallelize(Seq(0)).map{ _ => SparkFiles.get("README.md")}.first() > {code} > and also successfully with Python API: > {code} > from pyspark import SparkFiles > sc.addFile("./README.md") > sc.parallelize(range(1)).map(lambda x: SparkFiles.get("README.md")).first() > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20052) Some InputDStream needs closing processing after processing all batches when graceful shutdown
[ https://issues.apache.org/jira/browse/SPARK-20052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sasaki Toru updated SPARK-20052: Summary: Some InputDStream needs closing processing after processing all batches when graceful shutdown (was: Some InputDStream needs closing processing after all batches processed when graceful shutdown) > Some InputDStream needs closing processing after processing all batches when > graceful shutdown > -- > > Key: SPARK-20052 > URL: https://issues.apache.org/jira/browse/SPARK-20052 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Sasaki Toru > > Some class extend InputDStream needs closing processing after processing all > batches when graceful shutdown enabled. > (e.g. When using Kafka as data source, need to commit processed offsets to > Kafka Broker) > InputDStream has method 'stop' to stop receiving data, but this method will > be called before processing last batches generated for graceful shutdown. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20052) Some InputDStream needs closing processing after all batches processed when graceful shutdown
[ https://issues.apache.org/jira/browse/SPARK-20052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935768#comment-15935768 ] Sean Owen commented on SPARK-20052: --- What do you have in mind? I don't think stopping the stream makes all batches finish immediately. > Some InputDStream needs closing processing after all batches processed when > graceful shutdown > - > > Key: SPARK-20052 > URL: https://issues.apache.org/jira/browse/SPARK-20052 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Sasaki Toru > > Some class extend InputDStream needs closing processing after processing all > batches when graceful shutdown enabled. > (e.g. When using Kafka as data source, need to commit processed offsets to > Kafka Broker) > InputDStream has method 'stop' to stop receiving data, but this method will > be called before processing last batches generated for graceful shutdown. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20030) Add Event Time based Timeout
[ https://issues.apache.org/jira/browse/SPARK-20030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-20030. --- Resolution: Fixed Issue resolved by pull request 17361 [https://github.com/apache/spark/pull/17361] > Add Event Time based Timeout > > > Key: SPARK-20030 > URL: https://issues.apache.org/jira/browse/SPARK-20030 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13947) The error message from using an invalid table reference is not clear
[ https://issues.apache.org/jira/browse/SPARK-13947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-13947: Priority: Minor (was: Major) > The error message from using an invalid table reference is not clear > > > Key: SPARK-13947 > URL: https://issues.apache.org/jira/browse/SPARK-13947 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wes McKinney >Priority: Minor > > {code} > import numpy as np > import pandas as pd > df = pd.DataFrame({'foo': np.random.randn(1000), >'bar': np.random.randn(1000)}) > df2 = pd.DataFrame({'foo': np.random.randn(1000), > 'bar': np.random.randn(1000)}) > sdf = sqlContext.createDataFrame(df) > sdf2 = sqlContext.createDataFrame(df2) > sdf[sdf2.foo > 0] > {code} > Produces this error message: > {code} > AnalysisException: u'resolved attribute(s) foo#91 missing from bar#87,foo#88 > in operator !Filter (foo#91 > cast(0 as double));' > {code} > It may be possible to make it more clear what the user did wrong. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13947) PySpark DataFrames: The error message from using an invalid table reference is not clear
[ https://issues.apache.org/jira/browse/SPARK-13947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-13947: Component/s: (was: PySpark) SQL > PySpark DataFrames: The error message from using an invalid table reference > is not clear > > > Key: SPARK-13947 > URL: https://issues.apache.org/jira/browse/SPARK-13947 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wes McKinney > > {code} > import numpy as np > import pandas as pd > df = pd.DataFrame({'foo': np.random.randn(1000), >'bar': np.random.randn(1000)}) > df2 = pd.DataFrame({'foo': np.random.randn(1000), > 'bar': np.random.randn(1000)}) > sdf = sqlContext.createDataFrame(df) > sdf2 = sqlContext.createDataFrame(df2) > sdf[sdf2.foo > 0] > {code} > Produces this error message: > {code} > AnalysisException: u'resolved attribute(s) foo#91 missing from bar#87,foo#88 > in operator !Filter (foo#91 > cast(0 as double));' > {code} > It may be possible to make it more clear what the user did wrong. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13947) The error message from using an invalid table reference is not clear
[ https://issues.apache.org/jira/browse/SPARK-13947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-13947: Summary: The error message from using an invalid table reference is not clear (was: PySpark DataFrames: The error message from using an invalid table reference is not clear) > The error message from using an invalid table reference is not clear > > > Key: SPARK-13947 > URL: https://issues.apache.org/jira/browse/SPARK-13947 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wes McKinney > > {code} > import numpy as np > import pandas as pd > df = pd.DataFrame({'foo': np.random.randn(1000), >'bar': np.random.randn(1000)}) > df2 = pd.DataFrame({'foo': np.random.randn(1000), > 'bar': np.random.randn(1000)}) > sdf = sqlContext.createDataFrame(df) > sdf2 = sqlContext.createDataFrame(df2) > sdf[sdf2.foo > 0] > {code} > Produces this error message: > {code} > AnalysisException: u'resolved attribute(s) foo#91 missing from bar#87,foo#88 > in operator !Filter (foo#91 > cast(0 as double));' > {code} > It may be possible to make it more clear what the user did wrong. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20035) Spark 2.0.2 writes empty file if no record is in the dataset
[ https://issues.apache.org/jira/browse/SPARK-20035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935693#comment-15935693 ] Ryan Magnusson commented on SPARK-20035: I'd like to start looking into this if no one else is already. > Spark 2.0.2 writes empty file if no record is in the dataset > > > Key: SPARK-20035 > URL: https://issues.apache.org/jira/browse/SPARK-20035 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.2 > Environment: Spark 2.0.2 > Linux/Windows >Reporter: Andrew > > When there is no record in a dataset, the call to write with the spark-csv > creates empty file (i.e. with no title line) > ``` > dataset.write().format("com.databricks.spark.csv").option("header", > "true").save("... file name here ..."); > or > dataset.write().option("header", "true").csv("... file name here ..."); > ``` > The same file then cannot be read by using the same format (i.e. spark-csv) > since it is empty as below. The same call works if the dataset has at least > one record. > ``` > sqlCtx.read().format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").load("... file name here ..."); > or > sparkSession.read().option("header", "true").option("inferSchema", > "true").csv("... file name here ..."); > ``` > This is not right, you should always be able to read the file that you wrote > to. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3165) DecisionTree does not use sparsity in data
[ https://issues.apache.org/jira/browse/SPARK-3165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935646#comment-15935646 ] Facai Yan edited comment on SPARK-3165 at 3/22/17 1:57 AM: --- Do you mean that: TreePoint.binnedFeatures is Array[int], which doesn't use sparsity in data? So those modifications is need: 1. modify TreePoint.binnedFeatures to Vector. 2. modify LearningNode.predictImpl method if need. 3. modify the methods about Bin-wise computation, such as binSeqOp, to accelerate computation. Please correct me if misunderstand. I'd like to work on this if no one else has started it. was (Author: facai): Do you mean that: TreePoint.binnedFeatures is Array[int], which doesn't sparsity in data? So those modifications is need: 1. modify TreePoint.binnedFeatures to Vector. 2. modify LearningNode.predictImpl method if need. 3. modify the methods about Bin-wise computation, such as binSeqOp, to accelerate computation. Please correct me if misunderstand. I'd like to work on it. > DecisionTree does not use sparsity in data > -- > > Key: SPARK-3165 > URL: https://issues.apache.org/jira/browse/SPARK-3165 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > Improvement: computation > DecisionTree should take advantage of sparse feature vectors. Aggregation > over training data could handle the empty/zero-valued data elements more > efficiently. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3165) DecisionTree does not use sparsity in data
[ https://issues.apache.org/jira/browse/SPARK-3165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935646#comment-15935646 ] Facai Yan commented on SPARK-3165: -- Do you mean that: TreePoint.binnedFeatures is Array[int], which doesn't sparsity in data? So those modifications is need: 1. modify TreePoint.binnedFeatures to Vector. 2. modify LearningNode.predictImpl method if need. 3. modify the methods about Bin-wise computation, such as binSeqOp, to accelerate computation. Please correct me if misunderstand. I'd like to work on it. > DecisionTree does not use sparsity in data > -- > > Key: SPARK-3165 > URL: https://issues.apache.org/jira/browse/SPARK-3165 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > Improvement: computation > DecisionTree should take advantage of sparse feature vectors. Aggregation > over training data could handle the empty/zero-valued data elements more > efficiently. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20051) Fix StreamSuite.recover from v2.1 checkpoint failing with IOException
[ https://issues.apache.org/jira/browse/SPARK-20051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-20051. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17382 [https://github.com/apache/spark/pull/17382] > Fix StreamSuite.recover from v2.1 checkpoint failing with IOException > - > > Key: SPARK-20051 > URL: https://issues.apache.org/jira/browse/SPARK-20051 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Kunal Khamar > Fix For: 2.2.0 > > > There is a race condition between calling stop on a streaming query and > deleting directories in withTempDir that causes test to fail, fixing to do > lazy deletion using delete on shutdown JVM hook. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20009) Use user-friendly DDL formats for defining a schema in user-facing APIs
[ https://issues.apache.org/jira/browse/SPARK-20009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935629#comment-15935629 ] Xiao Li commented on SPARK-20009: - [~marmbrus] Does it sound OK to you? > Use user-friendly DDL formats for defining a schema in user-facing APIs > > > Key: SPARK-20009 > URL: https://issues.apache.org/jira/browse/SPARK-20009 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Takeshi Yamamuro > > In https://issues.apache.org/jira/browse/SPARK-19830, we add a new API in the > DDL parser to convert a DDL string into a schema. Then, we can use DDL > formats in existing some APIs, e.g., functions.from_json > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3062. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20052) Some InputDStream needs closing processing after all batches processed when graceful shutdown
[ https://issues.apache.org/jira/browse/SPARK-20052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sasaki Toru updated SPARK-20052: Description: Some class extend InputDStream needs closing processing after processing all batches when graceful shutdown enabled. (e.g. When using Kafka as data source, need to commit processed offsets to Kafka Broker) InputDStream has method 'stop' to stop receiving data, but this method will be called before processing last batches generated for graceful shutdown. was: Some class extend InputDStream needs closing processing after all batches processed when graceful shutdown enabled. (e.g. When using Kafka as data source, need to commit processed offsets to Kafka Broker) InputDStream has method 'stop' to stop receiving data, but this method will be called before processing last batches generated for graceful shutdown. > Some InputDStream needs closing processing after all batches processed when > graceful shutdown > - > > Key: SPARK-20052 > URL: https://issues.apache.org/jira/browse/SPARK-20052 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Sasaki Toru > > Some class extend InputDStream needs closing processing after processing all > batches when graceful shutdown enabled. > (e.g. When using Kafka as data source, need to commit processed offsets to > Kafka Broker) > InputDStream has method 'stop' to stop receiving data, but this method will > be called before processing last batches generated for graceful shutdown. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20008) hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 1
[ https://issues.apache.org/jira/browse/SPARK-20008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935620#comment-15935620 ] Hyukjin Kwon commented on SPARK-20008: -- Thank you for your kind explanation. I think you are more insightful in this issue than me. Could you fix this? > hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns > 1 > --- > > Key: SPARK-20008 > URL: https://issues.apache.org/jira/browse/SPARK-20008 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.2.0 >Reporter: Ravindra Bajpai > > hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() yields > 1 against expected 0. > This was not the case with spark 1.5.2. This is an api change from usage > point of view and hence I consider this as a bug. May be a boundary case, not > sure. > Work around - For now I check the counts != 0 before this operation. Not good > for performance. Hence creating a jira to track it. > As Young Zhang explained in reply to my mail - > Starting from Spark 2, these kind of operation are implemented in left anti > join, instead of using RDD operation directly. > Same issue also on sqlContext. > scala> spark.version > res25: String = 2.0.2 > spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true) > == Physical Plan == > *HashAggregate(keys=[], functions=[], output=[]) > +- Exchange SinglePartition >+- *HashAggregate(keys=[], functions=[], output=[]) > +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false > :- Scan ExistingRDD[] > +- BroadcastExchange IdentityBroadcastMode > +- Scan ExistingRDD[] > This arguably means a bug. But my guess is liking the logic of comparing NULL > = NULL, should it return true or false, causing this kind of confusion. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20054) [Mesos] Detectability for resource starvation
[ https://issues.apache.org/jira/browse/SPARK-20054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935605#comment-15935605 ] Michael Gummelt commented on SPARK-20054: - Sounds like this could be solved just by having some better logging? Something that indicates the driver is waiting for more registered executors? > [Mesos] Detectability for resource starvation > - > > Key: SPARK-20054 > URL: https://issues.apache.org/jira/browse/SPARK-20054 > Project: Spark > Issue Type: Improvement > Components: Mesos, Scheduler >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Kamal Gurala >Priority: Minor > > We currently use Mesos 1.1.0 for our Spark cluster in coarse-grained mode. We > had a production issue recently wherein we had our spark frameworks accept > resources from the Mesos master, so executors were started and spark driver > was aware of them, but the driver didn’t plan any task and nothing was > happening for a long time because it didn't meet a minimum registered > resources threshold. and the cluster is usually under-provisioned in order > because not all the jobs need to run at the same time. These held resources > were never offered back to the master for re-allocation leading to the entire > cluster to a halt until we had to manually intervene. > Using DRF for mesos and FIFO for Spark and the cluster is usually > under-provisioned. At any point of time there could be 10-15 spark frameworks > running on Mesos on the under-provisioned cluster > The ask is to have a way to better recoverability or detectability for a > scenario where the individual Spark frameworks hold onto resources but never > launch any tasks or have these frameworks release these resources after a > fixed amount of time. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20008) hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 1
[ https://issues.apache.org/jira/browse/SPARK-20008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935603#comment-15935603 ] Xiao Li commented on SPARK-20008: - In the traditional RDBMS, we do not allow users to create a table with zero column. Thus, the existing solution did not cover it. Do you want to fix it? [~hyukjin.kwon] Or you want me to fix it? > hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns > 1 > --- > > Key: SPARK-20008 > URL: https://issues.apache.org/jira/browse/SPARK-20008 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.2.0 >Reporter: Ravindra Bajpai > > hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() yields > 1 against expected 0. > This was not the case with spark 1.5.2. This is an api change from usage > point of view and hence I consider this as a bug. May be a boundary case, not > sure. > Work around - For now I check the counts != 0 before this operation. Not good > for performance. Hence creating a jira to track it. > As Young Zhang explained in reply to my mail - > Starting from Spark 2, these kind of operation are implemented in left anti > join, instead of using RDD operation directly. > Same issue also on sqlContext. > scala> spark.version > res25: String = 2.0.2 > spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true) > == Physical Plan == > *HashAggregate(keys=[], functions=[], output=[]) > +- Exchange SinglePartition >+- *HashAggregate(keys=[], functions=[], output=[]) > +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false > :- Scan ExistingRDD[] > +- BroadcastExchange IdentityBroadcastMode > +- Scan ExistingRDD[] > This arguably means a bug. But my guess is liking the logic of comparing NULL > = NULL, should it return true or false, causing this kind of confusion. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20009) Use user-friendly DDL formats for defining a schema in user-facing APIs
[ https://issues.apache.org/jira/browse/SPARK-20009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935599#comment-15935599 ] Takeshi Yamamuro edited comment on SPARK-20009 at 3/22/17 1:02 AM: --- I meant we support both a json-format and a new DDL format in existing APIs, as you said. This is like: https://github.com/apache/spark/compare/master...maropu:UserDDLForSchema#diff-df78a74ef92d9b8fb4ac142ff9a62464R111 was (Author: maropu): I meant we support both a json-format and a new DDL format in existing APIs, as you said. > Use user-friendly DDL formats for defining a schema in user-facing APIs > > > Key: SPARK-20009 > URL: https://issues.apache.org/jira/browse/SPARK-20009 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Takeshi Yamamuro > > In https://issues.apache.org/jira/browse/SPARK-19830, we add a new API in the > DDL parser to convert a DDL string into a schema. Then, we can use DDL > formats in existing some APIs, e.g., functions.from_json > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3062. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20009) Use user-friendly DDL formats for defining a schema in user-facing APIs
[ https://issues.apache.org/jira/browse/SPARK-20009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935599#comment-15935599 ] Takeshi Yamamuro commented on SPARK-20009: -- I meant we support both a json-format and a new DDL format in existing APIs, as you said. > Use user-friendly DDL formats for defining a schema in user-facing APIs > > > Key: SPARK-20009 > URL: https://issues.apache.org/jira/browse/SPARK-20009 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Takeshi Yamamuro > > In https://issues.apache.org/jira/browse/SPARK-19830, we add a new API in the > DDL parser to convert a DDL string into a schema. Then, we can use DDL > formats in existing some APIs, e.g., functions.from_json > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3062. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
subscribe to spark issues
subscribe to spark issues
[jira] [Resolved] (SPARK-19919) Defer input path validation into DataSource in CSV datasource
[ https://issues.apache.org/jira/browse/SPARK-19919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19919. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17256 [https://github.com/apache/spark/pull/17256] > Defer input path validation into DataSource in CSV datasource > - > > Key: SPARK-19919 > URL: https://issues.apache.org/jira/browse/SPARK-19919 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Priority: Trivial > Fix For: 2.2.0 > > > Currently, if other datasources fail to infer the schema, it returns {{None}} > and then this is being validated in {{DataSource}} as below: > {code} > scala> spark.read.json("emptydir") > org.apache.spark.sql.AnalysisException: Unable to infer schema for JSON. It > must be specified manually.; > {code} > {code} > scala> spark.read.orc("emptydir") > org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It > must be specified manually.; > {code} > {code} > scala> spark.read.parquet("emptydir") > org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. > It must be specified manually.; > {code} > However, CSV it checks it within the datasource implementation and throws > another exception message as below: > {code} > scala> spark.read.csv("emptydir") > java.lang.IllegalArgumentException: requirement failed: Cannot infer schema > from an empty set of files > {code} > We could remove this duplicated check and validate this in one place in the > same way with the same message. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19919) Defer input path validation into DataSource in CSV datasource
[ https://issues.apache.org/jira/browse/SPARK-19919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-19919: --- Assignee: Hyukjin Kwon > Defer input path validation into DataSource in CSV datasource > - > > Key: SPARK-19919 > URL: https://issues.apache.org/jira/browse/SPARK-19919 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Trivial > Fix For: 2.2.0 > > > Currently, if other datasources fail to infer the schema, it returns {{None}} > and then this is being validated in {{DataSource}} as below: > {code} > scala> spark.read.json("emptydir") > org.apache.spark.sql.AnalysisException: Unable to infer schema for JSON. It > must be specified manually.; > {code} > {code} > scala> spark.read.orc("emptydir") > org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It > must be specified manually.; > {code} > {code} > scala> spark.read.parquet("emptydir") > org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. > It must be specified manually.; > {code} > However, CSV it checks it within the datasource implementation and throws > another exception message as below: > {code} > scala> spark.read.csv("emptydir") > java.lang.IllegalArgumentException: requirement failed: Cannot infer schema > from an empty set of files > {code} > We could remove this duplicated check and validate this in one place in the > same way with the same message. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19980) Basic Dataset transformation on POJOs does not preserves nulls.
[ https://issues.apache.org/jira/browse/SPARK-19980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-19980: Fix Version/s: 2.1.1 > Basic Dataset transformation on POJOs does not preserves nulls. > --- > > Key: SPARK-19980 > URL: https://issues.apache.org/jira/browse/SPARK-19980 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Michel Lemay >Assignee: Takeshi Yamamuro > Fix For: 2.1.1, 2.2.0 > > > Applying an identity map transformation on a statically typed Dataset with a > POJO produces an unexpected result. > Given POJOs: > {code} > public class Stuff implements Serializable { > private String name; > public void setName(String name) { this.name = name; } > public String getName() { return name; } > } > public class Outer implements Serializable { > private String name; > private Stuff stuff; > public void setName(String name) { this.name = name; } > public String getName() { return name; } > public void setStuff(Stuff stuff) { this.stuff = stuff; } > public Stuff getStuff() { return stuff; } > } > {code} > Produces the result: > {code} > scala> val encoder = Encoders.bean(classOf[Outer]) > encoder: org.apache.spark.sql.Encoder[pojos.Outer] = class[name[0]: string, > stuff[0]: struct] > scala> val schema = encoder.schema > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(name,StringType,true), > StructField(stuff,StructType(StructField(name,StringType,true)),true)) > scala> schema.printTreeString > root > |-- name: string (nullable = true) > |-- stuff: struct (nullable = true) > ||-- name: string (nullable = true) > scala> val df = > spark.read.schema(schema).json("stuff.json").as[Outer](encoder) > df: org.apache.spark.sql.Dataset[pojos.Outer] = [name: string, stuff: > struct] > scala> df.show() > ++-+ > |name|stuff| > ++-+ > | v1| null| > ++-+ > scala> df.map(x => x)(encoder).show() > ++--+ > |name| stuff| > ++--+ > | v1|[null]| > ++--+ > {code} > After identity transformation, `stuff` becomes an object with null values > inside it instead of staying null itself. > Doing the same with case classes preserves the nulls: > {code} > scala> case class ScalaStuff(name: String) > defined class ScalaStuff > scala> case class ScalaOuter(name: String, stuff: ScalaStuff) > defined class ScalaOuter > scala> val encoder2 = Encoders.product[ScalaOuter] > encoder2: org.apache.spark.sql.Encoder[ScalaOuter] = class[name[0]: string, > stuff[0]: struct] > scala> val schema2 = encoder2.schema > schema2: org.apache.spark.sql.types.StructType = > StructType(StructField(name,StringType,true), > StructField(stuff,StructType(StructField(name,StringType,true)),true)) > scala> schema2.printTreeString > root > |-- name: string (nullable = true) > |-- stuff: struct (nullable = true) > ||-- name: string (nullable = true) > scala> > scala> val df2 = spark.read.schema(schema2).json("stuff.json").as[ScalaOuter] > df2: org.apache.spark.sql.Dataset[ScalaOuter] = [name: string, stuff: > struct] > scala> df2.show() > ++-+ > |name|stuff| > ++-+ > | v1| null| > ++-+ > scala> df2.map(x => x).show() > ++-+ > |name|stuff| > ++-+ > | v1| null| > ++-+ > {code} > stuff.json: > {code} > {"name":"v1", "stuff":null } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20054) [Mesos] Detectability for resource starvation
Kamal Gurala created SPARK-20054: Summary: [Mesos] Detectability for resource starvation Key: SPARK-20054 URL: https://issues.apache.org/jira/browse/SPARK-20054 Project: Spark Issue Type: Improvement Components: Mesos, Scheduler Affects Versions: 2.1.0, 2.0.2, 2.0.1, 2.0.0 Reporter: Kamal Gurala Priority: Minor We currently use Mesos 1.1.0 for our Spark cluster in coarse-grained mode. We had a production issue recently wherein we had our spark frameworks accept resources from the Mesos master, so executors were started and spark driver was aware of them, but the driver didn’t plan any task and nothing was happening for a long time because it didn't meet a minimum registered resources threshold. and the cluster is usually under-provisioned in order because not all the jobs need to run at the same time. These held resources were never offered back to the master for re-allocation leading to the entire cluster to a halt until we had to manually intervene. Using DRF for mesos and FIFO for Spark and the cluster is usually under-provisioned. At any point of time there could be 10-15 spark frameworks running on Mesos on the under-provisioned cluster The ask is to have a way to better recoverability or detectability for a scenario where the individual Spark frameworks hold onto resources but never launch any tasks or have these frameworks release these resources after a fixed amount of time. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20053) Can't select col when the dot (.) in col name
[ https://issues.apache.org/jira/browse/SPARK-20053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935568#comment-15935568 ] Xuxiang Mao commented on SPARK-20053: - This is how my code looks like: String cmdOutputFile = "/Downloads/output.csv"; SparkSession spark = SparkSession .builder().master("local[*]") .appName("PostProcessingBeta") .getOrCreate(); Dataset df = spark.read().option("maxCharsPerColumn", "4096").option("inferSchema", true).option("header", true).option("comment", "#").csv(cmdOutputFile); df.select("sd_1_2").show(); // this can successfully return the result. no "." in the column name. df.select("r_2_shape_1.8").show(); // this will throw the exception > Can't select col when the dot (.) in col name > - > > Key: SPARK-20053 > URL: https://issues.apache.org/jira/browse/SPARK-20053 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.0 > Environment: mac OX >Reporter: Xuxiang Mao > > I use java API read a csv file as Dataframe and try to do > Dataframe.select("column name").show(). This operation can successfully done > when the column name contains no ".", but it will fail when the column name > has ".". ERROR: > Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot > resolve '`r_2_shape_1.8`' given input columns: [z_2.1.1, z_2.1.11, > r_1.34.2, r_1.14.2, r_2_shape_1.8, z_1.2.39];; > 'Project ['r_2_shape_1.8] > +- TypedFilter > com.amazon.recommerce.pricing.forecasting.postProcessing.utils.RawFileUtils$1@a03529c, > interface org.apache.spark.sql.Row, [StructField(lp__,IntegerType,true), > StructField(b.1,DoubleType,true), > StructField(temp_Intercept,DoubleType,true), > StructField(b_shape.1,DoubleType,true), StructField(sd_1.1,DoubleType,true), > StructField(sd_1_2,DoubleType,true), StructField(z_1.1.1,DoubleType,true), > StructField(z_1.2.1,DoubleType,true), StructField(z_1.1.2,DoubleType,true), > StructField(z_1.2.2,DoubleType,true), StructField(z_1.1.3,DoubleType,true), > StructField(z_1.2.3,DoubleType,true), StructField(z_1.1.4,DoubleType,true), > StructField(z_1.2.4,DoubleType,true), StructField(z_1.1.5,DoubleType,true), > StructField(z_1.2.5,DoubleType,true), StructField(z_1.1.6,DoubleType,true), > StructField(z_1.2.6,DoubleType,true), StructField(z_1.1.7,DoubleType,true), > StructField(z_1.2.7,DoubleType,true), StructField(z_1.1.8,DoubleType,true), > StructField(z_1.2.8,DoubleType,true), StructField(z_1.1.9,DoubleType,true), > StructField(z_1.2.9,DoubleType,true), ... 294 more fields], > createexternalrow(lp__#0, b.1#1, temp_Intercept#2, b_shape.1#3, sd_1.1#4, > sd_1_2#5, z_1.1.1#6, z_1.2.1#7, z_1.1.2#8, z_1.2.2#9, z_1.1.3#10, z_1.2.3#11, > z_1.1.4#12, z_1.2.4#13, z_1.1.5#14, z_1.2.5#15, z_1.1.6#16, z_1.2.6#17, > z_1.1.7#18, z_1.2.7#19, z_1.1.8#20, z_1.2.8#21, z_1.1.9#22, z_1.2.9#23, ... > 612 more fields) >+- > Relation[lp__#0,b.1#1,temp_Intercept#2,b_shape.1#3,sd_1.1#4,sd_1_2#5,z_1.1.1#6,z_1.2.1#7,z_1.1.2#8,z_1.2.2#9,z_1.1.3#10,z_1.2.3#11,z_1.1.4#12,z_1.2.4#13,z_1.1.5#14,z_1.2.5#15,z_1.1.6#16,z_1.2.6#17,z_1.1.7#18,z_1.2.7#19,z_1.1.8#20,z_1.2.8#21,z_1.1.9#22,z_1.2.9#23,... > 294 more fields] csv > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:282) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:292) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:296) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at
[jira] [Created] (SPARK-20053) Can't select col when the dot (.) in col name
Xuxiang Mao created SPARK-20053: --- Summary: Can't select col when the dot (.) in col name Key: SPARK-20053 URL: https://issues.apache.org/jira/browse/SPARK-20053 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 2.1.0 Environment: mac OX Reporter: Xuxiang Mao I use java API read a csv file as Dataframe and try to do Dataframe.select("column name").show(). This operation can successfully done when the column name contains no ".", but it will fail when the column name has ".". ERROR: Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`r_2_shape_1.8`' given input columns: [z_2.1.1, z_2.1.11, r_1.34.2, r_1.14.2, r_2_shape_1.8, z_1.2.39];; 'Project ['r_2_shape_1.8] +- TypedFilter com.amazon.recommerce.pricing.forecasting.postProcessing.utils.RawFileUtils$1@a03529c, interface org.apache.spark.sql.Row, [StructField(lp__,IntegerType,true), StructField(b.1,DoubleType,true), StructField(temp_Intercept,DoubleType,true), StructField(b_shape.1,DoubleType,true), StructField(sd_1.1,DoubleType,true), StructField(sd_1_2,DoubleType,true), StructField(z_1.1.1,DoubleType,true), StructField(z_1.2.1,DoubleType,true), StructField(z_1.1.2,DoubleType,true), StructField(z_1.2.2,DoubleType,true), StructField(z_1.1.3,DoubleType,true), StructField(z_1.2.3,DoubleType,true), StructField(z_1.1.4,DoubleType,true), StructField(z_1.2.4,DoubleType,true), StructField(z_1.1.5,DoubleType,true), StructField(z_1.2.5,DoubleType,true), StructField(z_1.1.6,DoubleType,true), StructField(z_1.2.6,DoubleType,true), StructField(z_1.1.7,DoubleType,true), StructField(z_1.2.7,DoubleType,true), StructField(z_1.1.8,DoubleType,true), StructField(z_1.2.8,DoubleType,true), StructField(z_1.1.9,DoubleType,true), StructField(z_1.2.9,DoubleType,true), ... 294 more fields], createexternalrow(lp__#0, b.1#1, temp_Intercept#2, b_shape.1#3, sd_1.1#4, sd_1_2#5, z_1.1.1#6, z_1.2.1#7, z_1.1.2#8, z_1.2.2#9, z_1.1.3#10, z_1.2.3#11, z_1.1.4#12, z_1.2.4#13, z_1.1.5#14, z_1.2.5#15, z_1.1.6#16, z_1.2.6#17, z_1.1.7#18, z_1.2.7#19, z_1.1.8#20, z_1.2.8#21, z_1.1.9#22, z_1.2.9#23, ... 612 more fields) +- Relation[lp__#0,b.1#1,temp_Intercept#2,b_shape.1#3,sd_1.1#4,sd_1_2#5,z_1.1.1#6,z_1.2.1#7,z_1.1.2#8,z_1.2.2#9,z_1.1.3#10,z_1.2.3#11,z_1.1.4#12,z_1.2.4#13,z_1.1.5#14,z_1.2.5#15,z_1.1.6#16,z_1.2.6#17,z_1.1.7#18,z_1.2.7#19,z_1.1.8#20,z_1.2.8#21,z_1.1.9#22,z_1.2.9#23,... 294 more fields] csv at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:282) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:292) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:296) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:296) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:301) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:301) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) at
[jira] [Updated] (SPARK-20051) Fix StreamSuite.recover from v2.1 checkpoint failing with IOException
[ https://issues.apache.org/jira/browse/SPARK-20051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kunal Khamar updated SPARK-20051: - Description: There is a race condition between calling stop on a streaming query and deleting directories in withTempDir that causes test to fail, fixing to do lazy deletion using delete on shutdown JVM hook. (was: There is a race condition with deleting directories in withTempDir that causes test to fail, fixing to do lazy deletion using delete on shutdown JVM hook.) > Fix StreamSuite.recover from v2.1 checkpoint failing with IOException > - > > Key: SPARK-20051 > URL: https://issues.apache.org/jira/browse/SPARK-20051 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Kunal Khamar > > There is a race condition between calling stop on a streaming query and > deleting directories in withTempDir that causes test to fail, fixing to do > lazy deletion using delete on shutdown JVM hook. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20051) Fix StreamSuite.recover from v2.1 checkpoint failing with IOException
[ https://issues.apache.org/jira/browse/SPARK-20051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20051: Assignee: (was: Apache Spark) > Fix StreamSuite.recover from v2.1 checkpoint failing with IOException > - > > Key: SPARK-20051 > URL: https://issues.apache.org/jira/browse/SPARK-20051 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Kunal Khamar > > There is a race condition with deleting directories in withTempDir that > causes test to fail, fixing to do lazy deletion using delete on shutdown JVM > hook. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20051) Fix StreamSuite.recover from v2.1 checkpoint failing with IOException
[ https://issues.apache.org/jira/browse/SPARK-20051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20051: Assignee: Apache Spark > Fix StreamSuite.recover from v2.1 checkpoint failing with IOException > - > > Key: SPARK-20051 > URL: https://issues.apache.org/jira/browse/SPARK-20051 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Kunal Khamar >Assignee: Apache Spark > > There is a race condition with deleting directories in withTempDir that > causes test to fail, fixing to do lazy deletion using delete on shutdown JVM > hook. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20051) Fix StreamSuite.recover from v2.1 checkpoint failing with IOException
[ https://issues.apache.org/jira/browse/SPARK-20051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935564#comment-15935564 ] Apache Spark commented on SPARK-20051: -- User 'kunalkhamar' has created a pull request for this issue: https://github.com/apache/spark/pull/17382 > Fix StreamSuite.recover from v2.1 checkpoint failing with IOException > - > > Key: SPARK-20051 > URL: https://issues.apache.org/jira/browse/SPARK-20051 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Kunal Khamar > > There is a race condition with deleting directories in withTempDir that > causes test to fail, fixing to do lazy deletion using delete on shutdown JVM > hook. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20052) Some InputDStream needs closing processing after all batches processed when graceful shutdown
Sasaki Toru created SPARK-20052: --- Summary: Some InputDStream needs closing processing after all batches processed when graceful shutdown Key: SPARK-20052 URL: https://issues.apache.org/jira/browse/SPARK-20052 Project: Spark Issue Type: Improvement Components: DStreams Affects Versions: 2.2.0 Reporter: Sasaki Toru Some class extend InputDStream needs closing processing after all batches processed when graceful shutdown enabled. (e.g. When using Kafka as data source, need to commit processed offsets to Kafka Broker) InputDStream has method 'stop' to stop receiving data, but this method will be called before processing last batches generated for graceful shutdown. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20051) Fix StreamSuite.recover from v2.1 checkpoint failing with IOException
Kunal Khamar created SPARK-20051: Summary: Fix StreamSuite.recover from v2.1 checkpoint failing with IOException Key: SPARK-20051 URL: https://issues.apache.org/jira/browse/SPARK-20051 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.2.0 Reporter: Kunal Khamar There is a race condition with deleting directories in withTempDir that causes test to fail, fixing to do lazy deletion using delete on shutdown JVM hook. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20008) hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 1
[ https://issues.apache.org/jira/browse/SPARK-20008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935540#comment-15935540 ] Hyukjin Kwon commented on SPARK-20008: -- [~smilegator], it seems the discussion is about deuplicates in the result if I understood correctly. The problem here is {{Set() - Set()}} should return empty {{Set()}} which was previously done However, it seems now returning {{Set(Row())}} from empty dataframes. In the current master, {code} scala> spark.emptyDataFrame.except(spark.emptyDataFrame).collect() res0: Array[org.apache.spark.sql.Row] = Array([]) scala> spark.emptyDataFrame.collect() res1: Array[org.apache.spark.sql.Row] = Array() {code} I thought S∖S=∅ as below: {code} scala> spark.range(1).except(spark.range(1)).collect() res14: Array[Long] = Array() {code} > hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns > 1 > --- > > Key: SPARK-20008 > URL: https://issues.apache.org/jira/browse/SPARK-20008 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.2.0 >Reporter: Ravindra Bajpai > > hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() yields > 1 against expected 0. > This was not the case with spark 1.5.2. This is an api change from usage > point of view and hence I consider this as a bug. May be a boundary case, not > sure. > Work around - For now I check the counts != 0 before this operation. Not good > for performance. Hence creating a jira to track it. > As Young Zhang explained in reply to my mail - > Starting from Spark 2, these kind of operation are implemented in left anti > join, instead of using RDD operation directly. > Same issue also on sqlContext. > scala> spark.version > res25: String = 2.0.2 > spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true) > == Physical Plan == > *HashAggregate(keys=[], functions=[], output=[]) > +- Exchange SinglePartition >+- *HashAggregate(keys=[], functions=[], output=[]) > +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false > :- Scan ExistingRDD[] > +- BroadcastExchange IdentityBroadcastMode > +- Scan ExistingRDD[] > This arguably means a bug. But my guess is liking the logic of comparing NULL > = NULL, should it return true or false, causing this kind of confusion. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20047) Constrained Logistic Regression
[ https://issues.apache.org/jira/browse/SPARK-20047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-20047: Description: For certain applications, such as stacked regressions, it is important to put non-negative constraints on the regression coefficients. Also, if the ranges of coefficients are known, it makes sense to constrain the coefficient search space. Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ R^\{m×p\} and b ∈ R^\{m\} are predefined matrices and vectors which places a set of m linear constraints on the coefficients is very challenging as discussed in many literatures. However, for box constraints on the coefficients, the optimization is well solved. For gradient descent, people can projected gradient descent in the primal by zeroing the negative weights at each step. For LBFGS, an extended version of it, LBFGS-B can handle large scale box optimization efficiently. Unfortunately, for OWLQN, there is no good efficient way to do optimization with box constrains. As a result, in this work, we only implement constrained LR with box constrains without L1 regularization. Note that since we standardize the data in training phase, so the coefficients seen in the optimization subroutine are in the scaled space; as a result, we need to convert the box constrains into scaled space. Users will be able to set the lower / upper bounds of each coefficients and intercepts. was: For certain applications, such as stacked regressions, it is important to put non-negative constraints on the regression coefficients. Also, if the ranges of coefficients are known, it makes sense to constrain the coefficient search space. Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ R^\{m×p\} and b ∈ R^\{m\} are predefined matrices and vectors which places a set of m linear constraints on the coefficients is very challenging as discussed in many literatures. However, for box constraints on the coefficients, the optimization is well solved. For gradient descent, people can projected gradient descent in the primal by zeroing the negative weights at each step. For LBFGS, an extended version of it, LBFGS-B can handle large scale box optimization efficiently. Unfortunately, for OWLQN, there is no good efficient way to do optimization with box constrains. As a result, in this work, we only implement constrained LR with box constrains without L1 regularization. Note that since we standardize the data in training phase, so the coefficients seen in the optimization subroutine are in the scaled space; as a result, we need to convert the box constrains into scaled space. Users will be able to set the lower / upper bounds of each coefficients and intercepts. One solution could be to modify these implementations and do a Projected Gradient Descent in the primal by zeroing the negative weights at each step. But this process is inconvenient because the nice convergence properties are then lost. > Constrained Logistic Regression > --- > > Key: SPARK-20047 > URL: https://issues.apache.org/jira/browse/SPARK-20047 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.1.0 >Reporter: DB Tsai >Assignee: Yanbo Liang > > For certain applications, such as stacked regressions, it is important to put > non-negative constraints on the regression coefficients. Also, if the ranges > of coefficients are known, it makes sense to constrain the coefficient search > space. > Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ > R^\{m×p\} and b ∈ R^\{m\} are predefined matrices and vectors which places a > set of m linear constraints on the coefficients is very challenging as > discussed in many literatures. > However, for box constraints on the coefficients, the optimization is well > solved. For gradient descent, people can projected gradient descent in the > primal by zeroing the negative weights at each step. For LBFGS, an extended > version of it, LBFGS-B can handle large scale box optimization efficiently. > Unfortunately, for OWLQN, there is no good efficient way to do optimization > with box constrains. > As a result, in this work, we only implement constrained LR with box > constrains without L1 regularization. > Note that since we standardize the data in training phase, so the > coefficients seen in the optimization subroutine are in the scaled space; as > a result, we need to convert the box constrains into scaled space. > Users will be able to set the lower / upper bounds of each coefficients and > intercepts. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail:
[jira] [Updated] (SPARK-20047) Constrained Logistic Regression
[ https://issues.apache.org/jira/browse/SPARK-20047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-20047: Description: For certain applications, such as stacked regressions, it is important to put non-negative constraints on the regression coefficients. Also, if the ranges of coefficients are known, it makes sense to constrain the coefficient search space. Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ R^\{m×p\} and b ∈ R^\{m\} are predefined matrices and vectors which places a set of m linear constraints on the coefficients is very challenging as discussed in many literatures. However, for box constraints on the coefficients, the optimization is well solved. For gradient descent, people can projected gradient descent in the primal by zeroing the negative weights at each step. For LBFGS, an extended version of it, LBFGS-B can handle large scale box optimization efficiently. Unfortunately, for OWLQN, there is no good efficient way to do optimization with box constrains. As a result, in this work, we only implement constrained LR with box constrains without L1 regularization. Note that since we standardize the data in training phase, so the coefficients seen in the optimization subroutine are in the scaled space; as a result, we need to convert the box constrains into scaled space. Users will be able to set the lower / upper bounds of each coefficients and intercepts. One solution could be to modify these implementations and do a Projected Gradient Descent in the primal by zeroing the negative weights at each step. But this process is inconvenient because the nice convergence properties are then lost. was: For certain applications, such as stacked regressions, it is important to put non-negative constraints on the regression coefficients. Also, if the ranges of coefficients are known, it makes sense to constrain the coefficient search space. Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ R^\{m×p\} and b ∈ R^{m} are predefined matrices and vectors which places a set of m linear constraints on the coefficients is very challenging as discussed in many literatures. However, for box constraints on the coefficients, the optimization is well solved. For gradient descent, people can projected gradient descent in the primal by zeroing the negative weights at each step. For LBFGS, an extended version of it, LBFGS-B can handle large scale box optimization efficiently. Unfortunately, for OWLQN, there is no good efficient way to do optimization with box constrains. As a result, in this work, we only implement constrained LR with box constrains without L1 regularization. Note that since we standardize the data in training phase, so the coefficients seen in the optimization subroutine are in the scaled space; as a result, we need to convert the box constrains into scaled space. Users will be able to set the lower / upper bounds of each coefficients and intercepts. One solution could be to modify these implementations and do a Projected Gradient Descent in the primal by zeroing the negative weights at each step. But this process is inconvenient because the nice convergence properties are then lost. > Constrained Logistic Regression > --- > > Key: SPARK-20047 > URL: https://issues.apache.org/jira/browse/SPARK-20047 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.1.0 >Reporter: DB Tsai >Assignee: Yanbo Liang > > For certain applications, such as stacked regressions, it is important to put > non-negative constraints on the regression coefficients. Also, if the ranges > of coefficients are known, it makes sense to constrain the coefficient search > space. > Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ > R^\{m×p\} and b ∈ R^\{m\} are predefined matrices and vectors which places a > set of m linear constraints on the coefficients is very challenging as > discussed in many literatures. > However, for box constraints on the coefficients, the optimization is well > solved. For gradient descent, people can projected gradient descent in the > primal by zeroing the negative weights at each step. For LBFGS, an extended > version of it, LBFGS-B can handle large scale box optimization efficiently. > Unfortunately, for OWLQN, there is no good efficient way to do optimization > with box constrains. > As a result, in this work, we only implement constrained LR with box > constrains without L1 regularization. > Note that since we standardize the data in training phase, so the > coefficients seen in the optimization subroutine are in the scaled space; as > a result, we need to convert the box constrains into scaled space. >
[jira] [Updated] (SPARK-20047) Constrained Logistic Regression
[ https://issues.apache.org/jira/browse/SPARK-20047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-20047: Description: For certain applications, such as stacked regressions, it is important to put non-negative constraints on the regression coefficients. Also, if the ranges of coefficients are known, it makes sense to constrain the coefficient search space. Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ R^\{m×p\} and b ∈ R^{m} are predefined matrices and vectors which places a set of m linear constraints on the coefficients is very challenging as discussed in many literatures. However, for box constraints on the coefficients, the optimization is well solved. For gradient descent, people can projected gradient descent in the primal by zeroing the negative weights at each step. For LBFGS, an extended version of it, LBFGS-B can handle large scale box optimization efficiently. Unfortunately, for OWLQN, there is no good efficient way to do optimization with box constrains. As a result, in this work, we only implement constrained LR with box constrains without L1 regularization. Note that since we standardize the data in training phase, so the coefficients seen in the optimization subroutine are in the scaled space; as a result, we need to convert the box constrains into scaled space. Users will be able to set the lower / upper bounds of each coefficients and intercepts. One solution could be to modify these implementations and do a Projected Gradient Descent in the primal by zeroing the negative weights at each step. But this process is inconvenient because the nice convergence properties are then lost. was: For certain applications, such as stacked regressions, it is important to put non-negative constraints on the regression coefficients. Also, if the ranges of coefficients are known, it makes sense to constrain the coefficient search space. Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ R^{m×p} and b ∈ R^{m} are predefined matrices and vectors which places a set of m linear constraints on the coefficients is very challenging as discussed in many literatures. However, for box constraints on the coefficients, the optimization is well solved. For gradient descent, people can projected gradient descent in the primal by zeroing the negative weights at each step. For LBFGS, an extended version of it, LBFGS-B can handle large scale box optimization efficiently. Unfortunately, for OWLQN, there is no good efficient way to do optimization with box constrains. As a result, in this work, we only implement constrained LR with box constrains without L1 regularization. Note that since we standardize the data in training phase, so the coefficients seen in the optimization subroutine are in the scaled space; as a result, we need to convert the box constrains into scaled space. Users will be able to set the lower / upper bounds of each coefficients and intercepts. One solution could be to modify these implementations and do a Projected Gradient Descent in the primal by zeroing the negative weights at each step. But this process is inconvenient because the nice convergence properties are then lost. > Constrained Logistic Regression > --- > > Key: SPARK-20047 > URL: https://issues.apache.org/jira/browse/SPARK-20047 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.1.0 >Reporter: DB Tsai >Assignee: Yanbo Liang > > For certain applications, such as stacked regressions, it is important to put > non-negative constraints on the regression coefficients. Also, if the ranges > of coefficients are known, it makes sense to constrain the coefficient search > space. > Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ > R^\{m×p\} and b ∈ R^{m} are predefined matrices and vectors which places a > set of m linear constraints on the coefficients is very challenging as > discussed in many literatures. > However, for box constraints on the coefficients, the optimization is well > solved. For gradient descent, people can projected gradient descent in the > primal by zeroing the negative weights at each step. For LBFGS, an extended > version of it, LBFGS-B can handle large scale box optimization efficiently. > Unfortunately, for OWLQN, there is no good efficient way to do optimization > with box constrains. > As a result, in this work, we only implement constrained LR with box > constrains without L1 regularization. > Note that since we standardize the data in training phase, so the > coefficients seen in the optimization subroutine are in the scaled space; as > a result, we need to convert the box constrains into scaled space. > Users
[jira] [Updated] (SPARK-20047) Constrained Logistic Regression
[ https://issues.apache.org/jira/browse/SPARK-20047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-20047: Description: For certain applications, such as stacked regressions, it is important to put non-negative constraints on the regression coefficients. Also, if the ranges of coefficients are known, it makes sense to constrain the coefficient search space. Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ R^{m×p} and b ∈ R^{m} are predefined matrices and vectors which places a set of m linear constraints on the coefficients is very challenging as discussed in many literatures. However, for box constraints on the coefficients, the optimization is well solved. For gradient descent, people can projected gradient descent in the primal by zeroing the negative weights at each step. For LBFGS, an extended version of it, LBFGS-B can handle large scale box optimization efficiently. Unfortunately, for OWLQN, there is no good efficient way to do optimization with box constrains. As a result, in this work, we only implement constrained LR with box constrains without L1 regularization. Note that since we standardize the data in training phase, so the coefficients seen in the optimization subroutine are in the scaled space; as a result, we need to convert the box constrains into scaled space. Users will be able to set the lower / upper bounds of each coefficients and intercepts. One solution could be to modify these implementations and do a Projected Gradient Descent in the primal by zeroing the negative weights at each step. But this process is inconvenient because the nice convergence properties are then lost. > Constrained Logistic Regression > --- > > Key: SPARK-20047 > URL: https://issues.apache.org/jira/browse/SPARK-20047 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.1.0 >Reporter: DB Tsai >Assignee: Yanbo Liang > > For certain applications, such as stacked regressions, it is important to put > non-negative constraints on the regression coefficients. Also, if the ranges > of coefficients are known, it makes sense to constrain the coefficient search > space. > Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ > R^{m×p} and b ∈ R^{m} are predefined matrices and vectors which places a > set of m linear constraints on the coefficients is very challenging as > discussed in many literatures. > However, for box constraints on the coefficients, the optimization is well > solved. For gradient descent, people can projected gradient descent in the > primal by zeroing the negative weights at each step. For LBFGS, an extended > version of it, LBFGS-B can handle large scale box optimization efficiently. > Unfortunately, for OWLQN, there is no good efficient way to do optimization > with box constrains. > As a result, in this work, we only implement constrained LR with box > constrains without L1 regularization. > Note that since we standardize the data in training phase, so the > coefficients seen in the optimization subroutine are in the scaled space; as > a result, we need to convert the box constrains into scaled space. > Users will be able to set the lower / upper bounds of each coefficients and > intercepts. > > One solution could be to modify these implementations and do a Projected > Gradient Descent in the primal by zeroing the negative weights at each step. > But this process is inconvenient because the nice convergence properties are > then lost. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20050) Kafka 0.10 DirectStream doesn't commit last processed batch's offset when graceful shutdown
[ https://issues.apache.org/jira/browse/SPARK-20050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sasaki Toru updated SPARK-20050: Description: I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and call 'DirectKafkaInputDStream#commitAsync' finally in each batches, such below {code} val kafkaStream = KafkaUtils.createDirectStream[String, String](...) kafkaStream.map { input => "key: " + input.key.toString + " value: " + input.value.toString + " offset: " + input.offset.toString }.foreachRDD { rdd => rdd.foreach { input => println(input) } } kafkaStream.foreachRDD { rdd => val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges) } {\code} Some records which processed in the last batch before Streaming graceful shutdown reprocess in the first batch after Spark Streaming restart. It may cause offsets specified in commitAsync will commit in the head of next batch. was: I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and call 'DirectKafkaInputDStream#commitAsync' finally in each batches such below {code} val kafkaStream = KafkaUtils.createDirectStream[String, String](...) kafkaStream.map { input => "key: " + input.key.toString + " value: " + input.value.toString + " offset: " + input.offset.toString }.foreachRDD { rdd => rdd.foreach { input => println(input) } } kafkaStream.foreachRDD { rdd => val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges) } {\code} Some records which processed in the last batch before Streaming graceful shutdown reprocess in the first batch after Spark Streaming restart. It may cause offsets specified in commitAsync will commit in the head of next batch. Issue Type: Bug (was: Improvement) > Kafka 0.10 DirectStream doesn't commit last processed batch's offset when > graceful shutdown > --- > > Key: SPARK-20050 > URL: https://issues.apache.org/jira/browse/SPARK-20050 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Sasaki Toru > > I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and > call 'DirectKafkaInputDStream#commitAsync' finally in each batches, such > below > {code} > val kafkaStream = KafkaUtils.createDirectStream[String, String](...) > kafkaStream.map { input => > "key: " + input.key.toString + " value: " + input.value.toString + " > offset: " + input.offset.toString > }.foreachRDD { rdd => > rdd.foreach { input => > println(input) > } > } > kafkaStream.foreachRDD { rdd => > val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges > kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges) > } > {\code} > Some records which processed in the last batch before Streaming graceful > shutdown reprocess in the first batch after Spark Streaming restart. > It may cause offsets specified in commitAsync will commit in the head of next > batch. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20050) Kafka 0.10 DirectStream doesn't commit last processed batch's offset when graceful shutdown
Sasaki Toru created SPARK-20050: --- Summary: Kafka 0.10 DirectStream doesn't commit last processed batch's offset when graceful shutdown Key: SPARK-20050 URL: https://issues.apache.org/jira/browse/SPARK-20050 Project: Spark Issue Type: Improvement Components: DStreams Affects Versions: 2.2.0 Reporter: Sasaki Toru I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and call 'DirectKafkaInputDStream#commitAsync' finally in each batches such below {code} val kafkaStream = KafkaUtils.createDirectStream[String, String](...) kafkaStream.map { input => "key: " + input.key.toString + " value: " + input.value.toString + " offset: " + input.offset.toString }.foreachRDD { rdd => rdd.foreach { input => println(input) } } kafkaStream.foreachRDD { rdd => val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges) } {\code} Some records which processed in the last batch before Streaming graceful shutdown reprocess in the first batch after Spark Streaming restart. It may cause offsets specified in commitAsync will commit in the head of next batch. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20023) Can not see table comment when describe formatted table
[ https://issues.apache.org/jira/browse/SPARK-20023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20023: Assignee: Apache Spark (was: Xiao Li) > Can not see table comment when describe formatted table > --- > > Key: SPARK-20023 > URL: https://issues.apache.org/jira/browse/SPARK-20023 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: chenerlu >Assignee: Apache Spark > > Spark 2.x implements create table by itself. > https://github.com/apache/spark/commit/7d2ed8cc030f3d84fea47fded072c320c3d87ca7 > But in the implement mentioned above, it remove table comment from > properties, so user can not see table comment through run "describe formatted > table". Similarly, when user alters table comment, he still can not see the > change of table comment through run "describe formatted table". > I wonder why we removed table comments, is this a bug? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20023) Can not see table comment when describe formatted table
[ https://issues.apache.org/jira/browse/SPARK-20023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20023: Assignee: Xiao Li (was: Apache Spark) > Can not see table comment when describe formatted table > --- > > Key: SPARK-20023 > URL: https://issues.apache.org/jira/browse/SPARK-20023 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: chenerlu >Assignee: Xiao Li > > Spark 2.x implements create table by itself. > https://github.com/apache/spark/commit/7d2ed8cc030f3d84fea47fded072c320c3d87ca7 > But in the implement mentioned above, it remove table comment from > properties, so user can not see table comment through run "describe formatted > table". Similarly, when user alters table comment, he still can not see the > change of table comment through run "describe formatted table". > I wonder why we removed table comments, is this a bug? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20023) Can not see table comment when describe formatted table
[ https://issues.apache.org/jira/browse/SPARK-20023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935507#comment-15935507 ] Apache Spark commented on SPARK-20023: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/17381 > Can not see table comment when describe formatted table > --- > > Key: SPARK-20023 > URL: https://issues.apache.org/jira/browse/SPARK-20023 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: chenerlu >Assignee: Xiao Li > > Spark 2.x implements create table by itself. > https://github.com/apache/spark/commit/7d2ed8cc030f3d84fea47fded072c320c3d87ca7 > But in the implement mentioned above, it remove table comment from > properties, so user can not see table comment through run "describe formatted > table". Similarly, when user alters table comment, he still can not see the > change of table comment through run "describe formatted table". > I wonder why we removed table comments, is this a bug? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19408) cardinality estimation involving two columns of the same table
[ https://issues.apache.org/jira/browse/SPARK-19408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ron Hu updated SPARK-19408: --- Target Version/s: 2.3.0 (was: 2.2.0) > cardinality estimation involving two columns of the same table > -- > > Key: SPARK-19408 > URL: https://issues.apache.org/jira/browse/SPARK-19408 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 2.1.0 >Reporter: Ron Hu > > In SPARK-17075, we estimate cardinality of predicate expression "column (op) > literal", where op is =, <, <=, >, or >=. In SQL queries, we also see > predicate expressions involving two columns such as "column-1 (op) column-2" > where column-1 and column-2 belong to same table. Note that, if column-1 and > column-2 belong to different tables, then it is a join operator's work, NOT a > filter operator's work. > In this jira, we want to estimate the filter factor of predicate expressions > involving two columns of same table. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20004) Spark thrift server ovewrites spark.app.name
[ https://issues.apache.org/jira/browse/SPARK-20004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935472#comment-15935472 ] Bo Meng edited comment on SPARK-20004 at 3/21/17 10:32 PM: --- I think you can still use --name for your app name. for example, /spark/sbin/start-thriftserver.sh --name="My server 1" was (Author: bomeng): I think you can still use --name for your app name. for example, /spark/sbin/start-thriftserver.sh --conf spark.yarn.queue=spark.client.$host --conf spark.app.name="ODBC server $host" > Spark thrift server ovewrites spark.app.name > > > Key: SPARK-20004 > URL: https://issues.apache.org/jira/browse/SPARK-20004 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Egor Pahomov >Priority: Minor > > {code} > export SPARK_YARN_APP_NAME="ODBC server $host" > /spark/sbin/start-thriftserver.sh --conf spark.yarn.queue=spark.client.$host > --conf spark.app.name="ODBC server $host" > {code} > And spark-defaults.conf contains: > {code} > spark.app.name "ODBC server spark01" > {code} > Still name in yarn is "Thrift JDBC/ODBC Server" -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20004) Spark thrift server ovewrites spark.app.name
[ https://issues.apache.org/jira/browse/SPARK-20004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935472#comment-15935472 ] Bo Meng commented on SPARK-20004: - I think you can still use --name for your app name. for example, /spark/sbin/start-thriftserver.sh --conf spark.yarn.queue=spark.client.$host --conf spark.app.name="ODBC server $host" > Spark thrift server ovewrites spark.app.name > > > Key: SPARK-20004 > URL: https://issues.apache.org/jira/browse/SPARK-20004 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Egor Pahomov >Priority: Minor > > {code} > export SPARK_YARN_APP_NAME="ODBC server $host" > /spark/sbin/start-thriftserver.sh --conf spark.yarn.queue=spark.client.$host > --conf spark.app.name="ODBC server $host" > {code} > And spark-defaults.conf contains: > {code} > spark.app.name "ODBC server spark01" > {code} > Still name in yarn is "Thrift JDBC/ODBC Server" -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20049) Writing data to Parquet with partitions takes very long after the job finishes
Jakub Nowacki created SPARK-20049: - Summary: Writing data to Parquet with partitions takes very long after the job finishes Key: SPARK-20049 URL: https://issues.apache.org/jira/browse/SPARK-20049 Project: Spark Issue Type: Bug Components: Input/Output, PySpark, SQL Affects Versions: 2.1.0 Environment: Spark 2.1.0, CDH 5.8, Python 3.4, Java 8, Debian GNU/Linux 8.7 (jessie) Reporter: Jakub Nowacki I was testing writing DataFrame to partitioned Parquet files.The command is quite straight forward and the data set is really a sample from larger data set in Parquet; the job is done in PySpark on YARN and written to HDFS: {code} # there is column 'date' in df df.write.partitionBy("date").parquet("dest_dir") {code} The reading part took as long as usual, but after the job has been marked in PySpark and UI as finished, the Python interpreter still was showing it as busy. Indeed, when I checked the HDFS folder I noticed that the files are still transferred from {{dest_dir/_temporary}} to all the {{dest_dir/date=*}} folders. First of all it takes much longer than saving the same set without partitioning. Second, it is done in the background, without visible progress of any kind. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4296) Throw "Expression not in GROUP BY" when using same expression in group by clause and select clause
[ https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935409#comment-15935409 ] Irina Truong edited comment on SPARK-4296 at 3/21/17 10:01 PM: --- I have the same exception with pyspark when my expression uses a compiled and registered Scala UDF. This is how it's registered: {noformat} sqlContext.registerJavaFunction("round_date", 'my.package.RoundDate') {noformat} And this is how it's called: {noformat} ipdb> sqlContext.sql("SELECT round_date(t.ts, '1day') from (select timestamp('2017-02-02T10:11:12') as ts union select timestamp('2017-02-02T10:19:00') as ts) as t group by round_date(t.ts, '1day')").show() *** AnalysisException: u"expression 't.`ts`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;\nAggregate [UDF(ts#80, 1day)], [UDF(ts#80, 1day) AS UDF(ts, 1day)#82]\n+- SubqueryAlias t\n +- Distinct\n +- Union\n :- Project [cast(2017-02-02T10:11:12 as timestamp) AS ts#80]\n : +- OneRowRelation$\n +- Project [cast(2017-02-02T10:19:00 as timestamp) AS ts#81]\n+- OneRowRelation$\n" {noformat} was (Author: irinatruong): I have the same exception with pyspark when my expression uses a compiled and registered Scala UDF. This is how it's registered: {noformat} sqlContext.registerJavaFunction("round_date", 'my.package.RoundDate') {noformat} And this is how it's called: {noformat} ipdb> sqlContext.sql("SELECT round_date(t.ts, '1day') from (select timestamp('2017-02-02T10:11:12') as ts union select timestamp('2017-02-02T10:19:00') as ts) as t group by round_date(t.ts, '1day')").show() *** AnalysisException: u"expression 't.`ts`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;\nAggregate [UDF(ts#80, 1day)], [UDF(ts#80, 1day) AS UDF(ts, 1day)#82]\n+- SubqueryAlias t\n +- Distinct\n +- Union\n :- Project [cast(2017-02-02T10:11:12 as timestamp) AS ts#80]\n : +- OneRowRelation$\n +- Project [cast(2017-02-02T10:19:00 as timestamp) AS ts#81]\n+- OneRowRelation$\n" {noformat} > Throw "Expression not in GROUP BY" when using same expression in group by > clause and select clause > --- > > Key: SPARK-4296 > URL: https://issues.apache.org/jira/browse/SPARK-4296 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0, 1.1.1, 1.2.0 >Reporter: Shixiong Zhu >Assignee: Cheng Lian >Priority: Blocker > Fix For: 1.2.1, 1.3.0 > > > When the input data has a complex structure, using same expression in group > by clause and select clause will throw "Expression not in GROUP BY". > {code:java} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.createSchemaRDD > case class Birthday(date: String) > case class Person(name: String, birthday: Birthday) > val people = sc.parallelize(List(Person("John", Birthday("1990-01-22")), > Person("Jim", Birthday("1980-02-28" > people.registerTempTable("people") > val year = sqlContext.sql("select count(*), upper(birthday.date) from people > group by upper(birthday.date)") > year.collect > {code} > Here is the plan of year: > {code:java} > SchemaRDD[3] at RDD at SchemaRDD.scala:105 > == Query Plan == > == Physical Plan == > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression > not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree: > Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date > AS date#9) AS c1#3] > Subquery people > LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at > ExistingRDD.scala:36 > {code} > The bug is the equality test for `Upper(birthday#1.date)` and > `Upper(birthday#1.date AS date#9)`. > Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias > expression. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4296) Throw "Expression not in GROUP BY" when using same expression in group by clause and select clause
[ https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935409#comment-15935409 ] Irina Truong edited comment on SPARK-4296 at 3/21/17 9:59 PM: -- I have the same exception with pyspark when my expression uses a compiled and registered Scala UDF. This is how it's registered: {noformat} sqlContext.registerJavaFunction("round_date", 'my.package.RoundDate') {noformat} And this is how it's called: {noformat} ipdb> sqlContext.sql("SELECT round_date(t.ts, '1day') from (select timestamp('2017-02-02T10:11:12') as ts union select timestamp('2017-02-02T10:19:00') as ts) as t group by round_date(t.ts, '1day')").show() *** AnalysisException: u"expression 't.`ts`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;\nAggregate [UDF(ts#80, 1day)], [UDF(ts#80, 1day) AS UDF(ts, 1day)#82]\n+- SubqueryAlias t\n +- Distinct\n +- Union\n :- Project [cast(2017-02-02T10:11:12 as timestamp) AS ts#80]\n : +- OneRowRelation$\n +- Project [cast(2017-02-02T10:19:00 as timestamp) AS ts#81]\n+- OneRowRelation$\n" {noformat} was (Author: irinatruong): I'm have the same exception with pyspark when my expression uses a compiled and registered Scala UDF: sqlContext.registerJavaFunction("round_date", 'my.package.RoundDate') ipdb> sqlContext.sql("SELECT round_date(t.ts, '1day') from (select timestamp('2017-02-02T10:11:12') as ts union select timestamp('2017-02-02T10:19:00') as ts) as t group by round_date(t.ts, '1day')").show() *** AnalysisException: u"expression 't.`ts`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;\nAggregate [UDF(ts#80, 1day)], [UDF(ts#80, 1day) AS UDF(ts, 1day)#82]\n+- SubqueryAlias t\n +- Distinct\n +- Union\n :- Project [cast(2017-02-02T10:11:12 as timestamp) AS ts#80]\n : +- OneRowRelation$\n +- Project [cast(2017-02-02T10:19:00 as timestamp) AS ts#81]\n+- OneRowRelation$\n" > Throw "Expression not in GROUP BY" when using same expression in group by > clause and select clause > --- > > Key: SPARK-4296 > URL: https://issues.apache.org/jira/browse/SPARK-4296 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0, 1.1.1, 1.2.0 >Reporter: Shixiong Zhu >Assignee: Cheng Lian >Priority: Blocker > Fix For: 1.2.1, 1.3.0 > > > When the input data has a complex structure, using same expression in group > by clause and select clause will throw "Expression not in GROUP BY". > {code:java} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.createSchemaRDD > case class Birthday(date: String) > case class Person(name: String, birthday: Birthday) > val people = sc.parallelize(List(Person("John", Birthday("1990-01-22")), > Person("Jim", Birthday("1980-02-28" > people.registerTempTable("people") > val year = sqlContext.sql("select count(*), upper(birthday.date) from people > group by upper(birthday.date)") > year.collect > {code} > Here is the plan of year: > {code:java} > SchemaRDD[3] at RDD at SchemaRDD.scala:105 > == Query Plan == > == Physical Plan == > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression > not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree: > Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date > AS date#9) AS c1#3] > Subquery people > LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at > ExistingRDD.scala:36 > {code} > The bug is the equality test for `Upper(birthday#1.date)` and > `Upper(birthday#1.date AS date#9)`. > Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias > expression. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4296) Throw "Expression not in GROUP BY" when using same expression in group by clause and select clause
[ https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935409#comment-15935409 ] Irina Truong commented on SPARK-4296: - I'm have the same exception with pyspark when my expression uses a compiled and registered Scala UDF: sqlContext.registerJavaFunction("round_date", 'my.package.RoundDate') ipdb> sqlContext.sql("SELECT round_date(t.ts, '1day') from (select timestamp('2017-02-02T10:11:12') as ts union select timestamp('2017-02-02T10:19:00') as ts) as t group by round_date(t.ts, '1day')").show() *** AnalysisException: u"expression 't.`ts`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;\nAggregate [UDF(ts#80, 1day)], [UDF(ts#80, 1day) AS UDF(ts, 1day)#82]\n+- SubqueryAlias t\n +- Distinct\n +- Union\n :- Project [cast(2017-02-02T10:11:12 as timestamp) AS ts#80]\n : +- OneRowRelation$\n +- Project [cast(2017-02-02T10:19:00 as timestamp) AS ts#81]\n+- OneRowRelation$\n" > Throw "Expression not in GROUP BY" when using same expression in group by > clause and select clause > --- > > Key: SPARK-4296 > URL: https://issues.apache.org/jira/browse/SPARK-4296 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0, 1.1.1, 1.2.0 >Reporter: Shixiong Zhu >Assignee: Cheng Lian >Priority: Blocker > Fix For: 1.2.1, 1.3.0 > > > When the input data has a complex structure, using same expression in group > by clause and select clause will throw "Expression not in GROUP BY". > {code:java} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.createSchemaRDD > case class Birthday(date: String) > case class Person(name: String, birthday: Birthday) > val people = sc.parallelize(List(Person("John", Birthday("1990-01-22")), > Person("Jim", Birthday("1980-02-28" > people.registerTempTable("people") > val year = sqlContext.sql("select count(*), upper(birthday.date) from people > group by upper(birthday.date)") > year.collect > {code} > Here is the plan of year: > {code:java} > SchemaRDD[3] at RDD at SchemaRDD.scala:105 > == Query Plan == > == Physical Plan == > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression > not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree: > Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date > AS date#9) AS c1#3] > Subquery people > LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at > ExistingRDD.scala:36 > {code} > The bug is the equality test for `Upper(birthday#1.date)` and > `Upper(birthday#1.date AS date#9)`. > Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias > expression. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19237) SparkR package on Windows waiting for a long time when no java is found launching spark-submit
[ https://issues.apache.org/jira/browse/SPARK-19237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman reassigned SPARK-19237: - Assignee: Felix Cheung > SparkR package on Windows waiting for a long time when no java is found > launching spark-submit > -- > > Key: SPARK-19237 > URL: https://issues.apache.org/jira/browse/SPARK-19237 > Project: Spark > Issue Type: Bug > Components: Spark Core, SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Felix Cheung > Fix For: 2.1.1, 2.2.0 > > > When installing SparkR as a R package (install.packages) on Windows, it will > check for Spark distribution and automatically download and cache it. But if > there is no java runtime on the machine spark-submit will just hang. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19237) SparkR package on Windows waiting for a long time when no java is found launching spark-submit
[ https://issues.apache.org/jira/browse/SPARK-19237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-19237. --- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 Issue resolved by pull request 16596 [https://github.com/apache/spark/pull/16596] > SparkR package on Windows waiting for a long time when no java is found > launching spark-submit > -- > > Key: SPARK-19237 > URL: https://issues.apache.org/jira/browse/SPARK-19237 > Project: Spark > Issue Type: Bug > Components: Spark Core, SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung > Fix For: 2.1.1, 2.2.0 > > > When installing SparkR as a R package (install.packages) on Windows, it will > check for Spark distribution and automatically download and cache it. But if > there is no java runtime on the machine spark-submit will just hang. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20048) Cloning SessionState does not clone query execution listeners
[ https://issues.apache.org/jira/browse/SPARK-20048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20048: Assignee: Apache Spark > Cloning SessionState does not clone query execution listeners > - > > Key: SPARK-20048 > URL: https://issues.apache.org/jira/browse/SPARK-20048 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kunal Khamar >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20048) Cloning SessionState does not clone query execution listeners
[ https://issues.apache.org/jira/browse/SPARK-20048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935355#comment-15935355 ] Apache Spark commented on SPARK-20048: -- User 'kunalkhamar' has created a pull request for this issue: https://github.com/apache/spark/pull/17379 > Cloning SessionState does not clone query execution listeners > - > > Key: SPARK-20048 > URL: https://issues.apache.org/jira/browse/SPARK-20048 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kunal Khamar > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20048) Cloning SessionState does not clone query execution listeners
[ https://issues.apache.org/jira/browse/SPARK-20048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20048: Assignee: (was: Apache Spark) > Cloning SessionState does not clone query execution listeners > - > > Key: SPARK-20048 > URL: https://issues.apache.org/jira/browse/SPARK-20048 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kunal Khamar > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20048) Cloning SessionState does not clone query execution listeners
Kunal Khamar created SPARK-20048: Summary: Cloning SessionState does not clone query execution listeners Key: SPARK-20048 URL: https://issues.apache.org/jira/browse/SPARK-20048 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Kunal Khamar -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20023) Can not see table comment when describe formatted table
[ https://issues.apache.org/jira/browse/SPARK-20023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-20023: --- Assignee: Xiao Li > Can not see table comment when describe formatted table > --- > > Key: SPARK-20023 > URL: https://issues.apache.org/jira/browse/SPARK-20023 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: chenerlu >Assignee: Xiao Li > > Spark 2.x implements create table by itself. > https://github.com/apache/spark/commit/7d2ed8cc030f3d84fea47fded072c320c3d87ca7 > But in the implement mentioned above, it remove table comment from > properties, so user can not see table comment through run "describe formatted > table". Similarly, when user alters table comment, he still can not see the > change of table comment through run "describe formatted table". > I wonder why we removed table comments, is this a bug? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20023) Can not see table comment when describe formatted table
[ https://issues.apache.org/jira/browse/SPARK-20023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935320#comment-15935320 ] Xiao Li commented on SPARK-20023: - {{DESC EXTENDED}} works. Obviously, {{DESC FORMATTED}} has a bug > Can not see table comment when describe formatted table > --- > > Key: SPARK-20023 > URL: https://issues.apache.org/jira/browse/SPARK-20023 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: chenerlu > > Spark 2.x implements create table by itself. > https://github.com/apache/spark/commit/7d2ed8cc030f3d84fea47fded072c320c3d87ca7 > But in the implement mentioned above, it remove table comment from > properties, so user can not see table comment through run "describe formatted > table". Similarly, when user alters table comment, he still can not see the > change of table comment through run "describe formatted table". > I wonder why we removed table comments, is this a bug? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935304#comment-15935304 ] Miao Wang commented on SPARK-19634: --- Comments never come to email box. [~timhunter] I can continue with your code. Let me check it out. Thanks! > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20047) Constrained Logistic Regression
DB Tsai created SPARK-20047: --- Summary: Constrained Logistic Regression Key: SPARK-20047 URL: https://issues.apache.org/jira/browse/SPARK-20047 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 2.1.0 Reporter: DB Tsai Assignee: Yanbo Liang -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20008) hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 1
[ https://issues.apache.org/jira/browse/SPARK-20008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935278#comment-15935278 ] Xiao Li commented on SPARK-20008: - See the discussion https://github.com/apache/spark/pull/12736#r61344182 The behavior of the previous EXCEPT is wrong. > hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns > 1 > --- > > Key: SPARK-20008 > URL: https://issues.apache.org/jira/browse/SPARK-20008 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.2.0 >Reporter: Ravindra Bajpai > > hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() yields > 1 against expected 0. > This was not the case with spark 1.5.2. This is an api change from usage > point of view and hence I consider this as a bug. May be a boundary case, not > sure. > Work around - For now I check the counts != 0 before this operation. Not good > for performance. Hence creating a jira to track it. > As Young Zhang explained in reply to my mail - > Starting from Spark 2, these kind of operation are implemented in left anti > join, instead of using RDD operation directly. > Same issue also on sqlContext. > scala> spark.version > res25: String = 2.0.2 > spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true) > == Physical Plan == > *HashAggregate(keys=[], functions=[], output=[]) > +- Exchange SinglePartition >+- *HashAggregate(keys=[], functions=[], output=[]) > +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false > :- Scan ExistingRDD[] > +- BroadcastExchange IdentityBroadcastMode > +- Scan ExistingRDD[] > This arguably means a bug. But my guess is liking the logic of comparing NULL > = NULL, should it return true or false, causing this kind of confusion. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20009) Use user-friendly DDL formats for defining a schema in user-facing APIs
[ https://issues.apache.org/jira/browse/SPARK-20009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935264#comment-15935264 ] Xiao Li commented on SPARK-20009: - Are you suggesting to change the semantics of the parameter of the external API? We are unable to break the existing one. Maybe we can support both? Try to detect whether it is in the JSON format. If not, we can try to parse it as the DDL format? > Use user-friendly DDL formats for defining a schema in user-facing APIs > > > Key: SPARK-20009 > URL: https://issues.apache.org/jira/browse/SPARK-20009 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Takeshi Yamamuro > > In https://issues.apache.org/jira/browse/SPARK-19830, we add a new API in the > DDL parser to convert a DDL string into a schema. Then, we can use DDL > formats in existing some APIs, e.g., functions.from_json > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3062. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20044) Support Spark UI behind front-end reverse proxy using a path prefix
[ https://issues.apache.org/jira/browse/SPARK-20044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935223#comment-15935223 ] Alex Bozarth commented on SPARK-20044: -- I like this idea in theory, but I worried it would take a large sweeping code change to work. If you have an implication idea already I would suggest opening a pr. For me, accepting this would hinge on how it's implemented, I'd rather not add lots of new code across the entire web ui. [~vanzin] and [~tgraves] what do you guys think, you helped review the reverse proxy pr > Support Spark UI behind front-end reverse proxy using a path prefix > --- > > Key: SPARK-20044 > URL: https://issues.apache.org/jira/browse/SPARK-20044 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.0 >Reporter: Oliver Koeth >Priority: Minor > Labels: reverse-proxy, sso > > Purpose: allow to run the Spark web UI behind a reverse proxy with URLs > prefixed by a context root, like www.mydomain.com/spark. In particular, this > allows to access multiple Spark clusters through the same virtual host, only > distinguishing them by context root, like www.mydomain.com/cluster1, > www.mydomain.com/cluster2, and it allows to run the Spark UI in a common > cookie domain (for SSO) with other services. > [SPARK-15487] introduced some support for front-end reverse proxies by > allowing all Spark UI requests to be routed through the master UI as a single > endpoint and also added a spark.ui.reverseProxyUrl setting to define a > another proxy sitting in front of Spark. However, as noted in the comments on > [SPARK-15487], this mechanism does not currently work if the reverseProxyUrl > includes a context root like the examples above: Most links generated by the > Spark UI result in full path URLs (like /proxy/app-"id"/...) that do not > account for a path prefix (context root) and work only if the Spark UI "owns" > the entire virtual host. In fact, the only place in the UI where the > reverseProxyUrl seems to be used is the back-link from the worker UI to the > master UI. > The discussion on [SPARK-15487] proposes to open a new issue for the problem, > but that does not seem to have happened, so this issue aims to address the > remaining shortcomings of spark.ui.reverseProxyUrl > The problem can be partially worked around by doing content rewrite in a > front-end proxy and prefixing src="/..." or href="/..." links with a context > root. However, detecting and patching URLs in HTML output is not a robust > approach and breaks down for URLs included in custom REST responses. E.g. the > "allexecutors" REST call used from the Spark 2.1.0 application/executors page > returns links for log viewing that direct to the worker UI and do not work in > this scenario. > This issue proposes to honor spark.ui.reverseProxyUrl throughout Spark UI URL > generation. Experiments indicate that most of this can simply be achieved by > using/prepending spark.ui.reverseProxyUrl to the existing spark.ui.proxyBase > system property. Beyond that, the places that require adaption are > - worker and application links in the master web UI > - webui URLs returned by REST interfaces > Note: It seems that returned redirect location headers do not need to be > adapted, since URL rewriting for these is commonly done in front-end proxies > and has a well-defined interface -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20046) Facilitate loop optimizations in a JIT compiler regarding sqlContext.read.parquet()
[ https://issues.apache.org/jira/browse/SPARK-20046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20046: Assignee: Apache Spark > Facilitate loop optimizations in a JIT compiler regarding > sqlContext.read.parquet() > --- > > Key: SPARK-20046 > URL: https://issues.apache.org/jira/browse/SPARK-20046 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki >Assignee: Apache Spark > > [This > article|https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html] > suggests that better generated code can improve performance by facilitating > compiler optimizations. > This JIRA changes the generated code for {{sqlContext.read.parquet("file")}} > to facilitate loop optimizations in a JIT compiler for achieving better > performance. In particular, [this stackoverflow > entry|http://stackoverflow.com/questions/40629435/fast-parquet-row-count-in-spark] > suggests me to improve performance of > {{sqlContext.read.parquet("file").count}}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20046) Facilitate loop optimizations in a JIT compiler regarding sqlContext.read.parquet()
[ https://issues.apache.org/jira/browse/SPARK-20046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20046: Assignee: (was: Apache Spark) > Facilitate loop optimizations in a JIT compiler regarding > sqlContext.read.parquet() > --- > > Key: SPARK-20046 > URL: https://issues.apache.org/jira/browse/SPARK-20046 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki > > [This > article|https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html] > suggests that better generated code can improve performance by facilitating > compiler optimizations. > This JIRA changes the generated code for {{sqlContext.read.parquet("file")}} > to facilitate loop optimizations in a JIT compiler for achieving better > performance. In particular, [this stackoverflow > entry|http://stackoverflow.com/questions/40629435/fast-parquet-row-count-in-spark] > suggests me to improve performance of > {{sqlContext.read.parquet("file").count}}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20046) Facilitate loop optimizations in a JIT compiler regarding sqlContext.read.parquet()
[ https://issues.apache.org/jira/browse/SPARK-20046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935162#comment-15935162 ] Apache Spark commented on SPARK-20046: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/17378 > Facilitate loop optimizations in a JIT compiler regarding > sqlContext.read.parquet() > --- > > Key: SPARK-20046 > URL: https://issues.apache.org/jira/browse/SPARK-20046 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki > > [This > article|https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html] > suggests that better generated code can improve performance by facilitating > compiler optimizations. > This JIRA changes the generated code for {{sqlContext.read.parquet("file")}} > to facilitate loop optimizations in a JIT compiler for achieving better > performance. In particular, [this stackoverflow > entry|http://stackoverflow.com/questions/40629435/fast-parquet-row-count-in-spark] > suggests me to improve performance of > {{sqlContext.read.parquet("file").count}}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20046) Facilitate loop optimizations in a JIT compiler regarding sqlContext.read.parquet()
[ https://issues.apache.org/jira/browse/SPARK-20046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-20046: - Issue Type: Improvement (was: Bug) > Facilitate loop optimizations in a JIT compiler regarding > sqlContext.read.parquet() > --- > > Key: SPARK-20046 > URL: https://issues.apache.org/jira/browse/SPARK-20046 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki > > [This > article|https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html] > suggests that better generated code can improve performance by facilitating > compiler optimizations. > This JIRA changes the generated code for {{sqlContext.read.parquet("file")}} > to facilitate loop optimizations in a JIT compiler for achieving better > performance. In particular, [this stackoverflow > entry|http://stackoverflow.com/questions/40629435/fast-parquet-row-count-in-spark] > suggests me to improve performance of > {{sqlContext.read.parquet("file").count}}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20046) Facilitate loop optimizations in a JIT compiler regarding sqlContext.read.parquet()
Kazuaki Ishizaki created SPARK-20046: Summary: Facilitate loop optimizations in a JIT compiler regarding sqlContext.read.parquet() Key: SPARK-20046 URL: https://issues.apache.org/jira/browse/SPARK-20046 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Kazuaki Ishizaki [This article|https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html] suggests that better generated code can improve performance by facilitating compiler optimizations. This JIRA changes the generated code for {{sqlContext.read.parquet("file")}} to facilitate loop optimizations in a JIT compiler for achieving better performance. In particular, [this stackoverflow entry|http://stackoverflow.com/questions/40629435/fast-parquet-row-count-in-spark] suggests me to improve performance of {{sqlContext.read.parquet("file").count}}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17136) Design optimizer interface for ML algorithms
[ https://issues.apache.org/jira/browse/SPARK-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935112#comment-15935112 ] Seth Hendrickson commented on SPARK-17136: -- The reason to support setting them in both places would be backwards compatibility mainly. If we still allow users to set {{maxIter}} on the estimator then we won't break code that previously did this. Specifying the optimizer, either one built into Spark or a custom one, would be optional and something mostly advanced users would do. About grid-based CV, this would be a point that we need to carefully consider and make sure that we get it right. We'd still allow users to search over grids of {{maxIter}}, {{tol}} etc... since those params are still there, but additionally users could search over different optimizers and optimizers with different parameters themselves. I think that could be a bit clunky, but it's open for design discussion. e.g. {code} val paramGrid = new ParamGridBuilder() .addGrid(lr.minimizer, Array(new LBFGS(), new OWLQN(), new LBFGSB(lb, ub))) .build() {code} Yes, there are cases where users could supply conflicting grids, but AFAICT this problem already exists, e.g. {code} val paramGrid = new ParamGridBuilder() .addGrid(lr.solver, Array("normal", "l-bfgs")) .addGrid(lr.maxIter, Array(10, 20)) // maxIter is ignored when solver is normal .build() {code} About your suggestion of mimicking Spark SQL - would you mind elaborating here or on the design doc? I'm not as familiar with it, so if you have some design in mind it would be great to hear that. > Design optimizer interface for ML algorithms > > > Key: SPARK-17136 > URL: https://issues.apache.org/jira/browse/SPARK-17136 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > We should consider designing an interface that allows users to use their own > optimizers in some of the ML algorithms, similar to MLlib. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20017) Functions "str_to_map" and "explode" throws NPE exceptioin
[ https://issues.apache.org/jira/browse/SPARK-20017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-20017: Labels: (was: correctness) > Functions "str_to_map" and "explode" throws NPE exceptioin > -- > > Key: SPARK-20017 > URL: https://issues.apache.org/jira/browse/SPARK-20017 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: roncenzhao >Assignee: roncenzhao > Fix For: 2.1.1, 2.2.0 > > Attachments: screenshot-1.png > > > ``` > val sqlDf = spark.sql("select k,v from (select str_to_map('') as map_col from > range(2)) tbl lateral view explode(map_col) as k,v") > sqlDf.show > ``` > The sql throws NPE exception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20017) Functions "str_to_map" and "explode" throws NPE exceptioin
[ https://issues.apache.org/jira/browse/SPARK-20017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20017. - Resolution: Fixed Assignee: roncenzhao Fix Version/s: 2.2.0 2.1.1 > Functions "str_to_map" and "explode" throws NPE exceptioin > -- > > Key: SPARK-20017 > URL: https://issues.apache.org/jira/browse/SPARK-20017 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: roncenzhao >Assignee: roncenzhao > Labels: correctness > Fix For: 2.1.1, 2.2.0 > > Attachments: screenshot-1.png > > > ``` > val sqlDf = spark.sql("select k,v from (select str_to_map('') as map_col from > range(2)) tbl lateral view explode(map_col) as k,v") > sqlDf.show > ``` > The sql throws NPE exception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20016) SparkLauncher submit job failed after setConf with special charaters under windows
[ https://issues.apache.org/jira/browse/SPARK-20016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935005#comment-15935005 ] Marcelo Vanzin commented on SPARK-20016: This was a long time ago and mostly trial & error, since Windows batch files make no sense. Since I don't really have a Windows test env anymore, I'd appreciated if someone who does have one can try things out. > SparkLauncher submit job failed after setConf with special charaters under > windows > -- > > Key: SPARK-20016 > URL: https://issues.apache.org/jira/browse/SPARK-20016 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.0.0 > Environment: windows 7, 8, 10, 2008, 2008R2, etc. >Reporter: Vincent Sun > > I am using sparkLauncher JAVA API to submit job to a remote spark cluster > master. Codes looks like follow: > /* > * launch Job > */ > public static void launch() throws Exception { > SparkLauncher spark = new SparkLauncher(); > spark.setAppName("sparkdemo").setAppResource("hdfs://10.250.1.121:9000/application.jar").setMainClass("test.Application"); > spark.setMaster(spark://10.250.1.120:6066); > spark.setDeployMode("cluster"); > spark.setConf("spark.executor.cores","2") > spark.setConf("spark.executor.memory","8G") > spark.startApplication(new MyAppListener(job.getAppName())); > } > It works fine under Linux/CentOS, but failed on my own desktop which is a > windows 8 OS. It will throw out error: > [launcher-proc-1] The filename, directory name, or volume label syntax is > incorrect. > The finial command I caught is this: > spark-submit.cmd --master spark://10.250.1.120:6066 --deploy-mode cluster > --name sparkdemo --conf "spark.executor.memory=8G" --conf > "spark.executor.cores=2" --class test.Application > hdfs://10.250.1.121:9000/application.jar > The quote on spark.executor.memory=8G and spark.executor.cores=2 cause the > exception. > After debug into the source code I found the reason is at: > quoteForBatchScript method of CommandBuilderUtils class > It will add quotes while there is '=' or some other kinds of special > characters under windows system. Here is the source codes: > static String quoteForBatchScript(String arg) { > boolean needsQuotes = false; > for (int i = 0; i < arg.length(); i++) { > int c = arg.codePointAt(i); > if (Character.isWhitespace(c) || c == '"' || c == '=' || c == ',' || c > == ';') { > needsQuotes = true; > break; > } > } > if (!needsQuotes) { > return arg; > } > StringBuilder quoted = new StringBuilder(); > quoted.append("\""); > for (int i = 0; i < arg.length(); i++) { > int cp = arg.codePointAt(i); > switch (cp) { > case '"': > quoted.append('"'); > break; > default: > break; > } > quoted.appendCodePoint(cp); > } > if (arg.codePointAt(arg.length() - 1) == '\\') { > quoted.append("\\"); > } > quoted.append("\""); > return quoted.toString(); > } -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20039) Rename ml.stat.ChiSquare to ml.stat.ChiSquareTest
[ https://issues.apache.org/jira/browse/SPARK-20039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20039: -- Priority: Minor (was: Major) > Rename ml.stat.ChiSquare to ml.stat.ChiSquareTest > - > > Key: SPARK-20039 > URL: https://issues.apache.org/jira/browse/SPARK-20039 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > Fix For: 2.2.0 > > > I realized that since {{ChiSquare}} is in the package {{stat}}, it's pretty > unclear if it's the hypothesis test, distribution, or what. I plan to rename > it to {{ChiSquareTest}} to clarify this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20039) Rename ml.stat.ChiSquare to ml.stat.ChiSquareTest
[ https://issues.apache.org/jira/browse/SPARK-20039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-20039. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17368 [https://github.com/apache/spark/pull/17368] > Rename ml.stat.ChiSquare to ml.stat.ChiSquareTest > - > > Key: SPARK-20039 > URL: https://issues.apache.org/jira/browse/SPARK-20039 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > Fix For: 2.2.0 > > > I realized that since {{ChiSquare}} is in the package {{stat}}, it's pretty > unclear if it's the hypothesis test, distribution, or what. I plan to rename > it to {{ChiSquareTest}} to clarify this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7129) Add generic boosting algorithm to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15934960#comment-15934960 ] Seth Hendrickson commented on SPARK-7129: - I don't think anyone is working on it. Though I'm afraid it is probably not a good use of time to spend on this task, for a couple of reasons. We still don't have weight support in trees and there is extremely limited bandwidth of reviewers/committers in Spark ML at the moment. Further, there are many more important tasks that need to be done in ML so I would rate this as low priority, which also means it is less likely to be reviewed or see much progress. Finally, given the recent success of things like xgboost/lightGBM, we may want to rethink/rewrite the existing boosting framework to see if we can get similar performance. If anything, I think we need to think about how we'd like to proceed improving the boosting libraries in Spark from an overall point of view, but that is a large task that is likely a few releases away. I'd be curious to hear others' thoughts of course, but this is the state of things AFAIK. I guess I don't see this as a priority, but it could become one given enough community interest. > Add generic boosting algorithm to spark.ml > -- > > Key: SPARK-7129 > URL: https://issues.apache.org/jira/browse/SPARK-7129 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley > > The Pipelines API will make it easier to create a generic Boosting algorithm > which can work with any Classifier or Regressor. Creating this feature will > require researching the possible variants and extensions of boosting which we > may want to support now and/or in the future, and planning an API which will > be properly extensible. > In particular, it will be important to think about supporting: > * multiple loss functions (for AdaBoost, LogitBoost, gradient boosting, etc.) > * multiclass variants > * multilabel variants (which will probably be in a separate class and JIRA) > * For more esoteric variants, we should consider them but not design too much > around them: totally corrective boosting, cascaded models > Note: This may interact some with the existing tree ensemble methods, but it > should be largely separate since the tree ensemble APIs and implementations > are specialized for trees. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17121) Support _HOST replacement for principal
[ https://issues.apache.org/jira/browse/SPARK-17121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15934908#comment-15934908 ] Chris Gianelloni commented on SPARK-17121: -- I find this useful when configuring Spark HistorySever to write to a Kerberos-enabled HDFS. > Support _HOST replacement for principal > --- > > Key: SPARK-17121 > URL: https://issues.apache.org/jira/browse/SPARK-17121 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.0.0 >Reporter: Jeff Zhang >Priority: Minor > > _HOST is a placeholder for the host which is used widely for hadoop > components (like NN/DN/RM/NM and etc), this is useful for automatic > configuration of some cluster deployment tool. It is would be nice that spark > can also support this, especially useful for spark thrift server. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20043) CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" on ML random forest and decision. Only "gini" and "entropy" (in lower case) are accepted
[ https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15934905#comment-15934905 ] Nick Pentreath commented on SPARK-20043: I just noticed the error message you put above says "Entorpy" - is that a spelling mistake in the JIRA description or in your code? > CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" > on ML random forest and decision. Only "gini" and "entropy" (in lower case) > are accepted > > > Key: SPARK-20043 > URL: https://issues.apache.org/jira/browse/SPARK-20043 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Zied Sellami > > I saved a CrossValidatorModel with a decision tree and a random forest. I use > Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not > able to load the saved model, when impurity are written not in lowercase. I > obtain an error from Spark "impurity Gini (Entorpy) not recognized. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20043) CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" on ML random forest and decision. Only "gini" and "entropy" (in lower case) are accepted
[ https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-20043: --- Docs Text: (was: I saved a CrossValidatorModel with a decision tree and a random forest. I use Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not able to load the saved model, when impurity are written not in lowercase. I obtain an error from Spark "impurity Gini (Entorpy) not recognized.) > CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" > on ML random forest and decision. Only "gini" and "entropy" (in lower case) > are accepted > > > Key: SPARK-20043 > URL: https://issues.apache.org/jira/browse/SPARK-20043 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Zied Sellami > > I saved a CrossValidatorModel with a decision tree and a random forest. I use > Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not > able to load the saved model, when impurity are written not in lowercase. I > obtain an error from Spark "impurity Gini (Entorpy) not recognized. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20043) CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" on ML random forest and decision. Only "gini" and "entropy" (in lower case) are accepted
[ https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-20043: --- Description: I saved a CrossValidatorModel with a decision tree and a random forest. I use Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not able to load the saved model, when impurity are written not in lowercase. I obtain an error from Spark "impurity Gini (Entorpy) not recognized. > CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" > on ML random forest and decision. Only "gini" and "entropy" (in lower case) > are accepted > > > Key: SPARK-20043 > URL: https://issues.apache.org/jira/browse/SPARK-20043 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Zied Sellami > > I saved a CrossValidatorModel with a decision tree and a random forest. I use > Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not > able to load the saved model, when impurity are written not in lowercase. I > obtain an error from Spark "impurity Gini (Entorpy) not recognized. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19934) code comments are not very clearly in BlackListTracker.scala
[ https://issues.apache.org/jira/browse/SPARK-19934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15934869#comment-15934869 ] Imran Rashid commented on SPARK-19934: -- technically, you are right, that "another" isn't really correct, it depends on the configs ... but I think this is a pretty insignificant change. The comment is more about "why" then the exact logic, which is best described by code anyhow. > code comments are not very clearly in BlackListTracker.scala > > > Key: SPARK-19934 > URL: https://issues.apache.org/jira/browse/SPARK-19934 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.0 >Reporter: zhoukang >Priority: Trivial > > {code} > def handleRemovedExecutor(executorId: String): Unit = { > // We intentionally do not clean up executors that are already > blacklisted in > // nodeToBlacklistedExecs, so that if another executor on the same node > gets blacklisted, we can > // blacklist the entire node. We also can't clean up > executorIdToBlacklistStatus, so we can > // eventually remove the executor after the timeout. Despite not > clearing those structures > // here, we don't expect they will grow too big since you won't get too > many executors on one > // node, and the timeout will clear it up periodically in any case. > executorIdToFailureList -= executorId > } > {code} > I think the comments should be: > {code} > // We intentionally do not clean up executors that are already blacklisted in > // nodeToBlacklistedExecs, so that if > {spark.blacklist.application.maxFailedExecutorsPerNode} - 1 executor on the > same node gets blacklisted, we can > // blacklist the entire node. > {code} > Reference from the design doc > https://docs.google.com/document/d/1R2CVKctUZG9xwD67jkRdhBR4sCgccPR2dhTYSRXFEmg/edit. > when consider update a node to application-level blacklist,should follow rule: > Nodes are placed into a blacklist for the entire application when the number > of blacklisted executors goes over > spark.blacklist.application.maxFailedExecutorsPerNode (default 2) > and the comment just explain as default value -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7129) Add generic boosting algorithm to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15934864#comment-15934864 ] Mohamed Baddar commented on SPARK-7129: --- [~josephkb] [~sethah] [~meihuawu] [~mlnick] If now one is working on this. Can I start working on it, I have small experience in contributing with starter tasks in Spark. If no one working on it I would love to start reading the design docs mentioned in comments and start discussing next steps > Add generic boosting algorithm to spark.ml > -- > > Key: SPARK-7129 > URL: https://issues.apache.org/jira/browse/SPARK-7129 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley > > The Pipelines API will make it easier to create a generic Boosting algorithm > which can work with any Classifier or Regressor. Creating this feature will > require researching the possible variants and extensions of boosting which we > may want to support now and/or in the future, and planning an API which will > be properly extensible. > In particular, it will be important to think about supporting: > * multiple loss functions (for AdaBoost, LogitBoost, gradient boosting, etc.) > * multiclass variants > * multilabel variants (which will probably be in a separate class and JIRA) > * For more esoteric variants, we should consider them but not design too much > around them: totally corrective boosting, cascaded models > Note: This may interact some with the existing tree ensemble methods, but it > should be largely separate since the tree ensemble APIs and implementations > are specialized for trees. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19261) Support `ALTER TABLE table_name ADD COLUMNS(..)` statement
[ https://issues.apache.org/jira/browse/SPARK-19261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-19261. - Resolution: Fixed Assignee: Xin Wu Fix Version/s: 2.2.0 > Support `ALTER TABLE table_name ADD COLUMNS(..)` statement > -- > > Key: SPARK-19261 > URL: https://issues.apache.org/jira/browse/SPARK-19261 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: StanZhai >Assignee: Xin Wu > Fix For: 2.2.0 > > > We should support `ALTER TABLE table_name ADD COLUMNS(..)` statement, which > already be used in version < 2.x. > This is very useful for those who want to upgrade there Spark version to 2.x. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20041) Update docs for NaN handling in approxQuantile
[ https://issues.apache.org/jira/browse/SPARK-20041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20041. - Resolution: Fixed Assignee: zhengruifeng Fix Version/s: 2.2.0 > Update docs for NaN handling in approxQuantile > -- > > Key: SPARK-20041 > URL: https://issues.apache.org/jira/browse/SPARK-20041 > Project: Spark > Issue Type: Improvement > Components: PySpark, SparkR >Affects Versions: 2.2.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Trivial > Fix For: 2.2.0 > > > {{approxQuantile}} in R and Python now support multi-column, and the current > note about NaN handling is out of date: > {{Note that rows containing any null values will be removed before > calculation.}} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17136) Design optimizer interface for ML algorithms
[ https://issues.apache.org/jira/browse/SPARK-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15934736#comment-15934736 ] Yanbo Liang edited comment on SPARK-17136 at 3/21/17 3:22 PM: -- [~sethah] Thanks for the design doc. One quick question: In your design, if we set the parameters in optimizer, Do we still support setting these parameters in estimator again? If yes, why we need to support two entrances for the same set of params? I saw you reply at the design doc, you propose to make the params in optimizer superior to the ones in estimator. Does it involves confusion for users and extra maintenance cost? Does the grid search-based model selection in the current framework (such as CrossValidator) can still work well? I'm more prefer to keep these params in estimators, make the optimizer layer as an internal API, and users can implement their own optimizer like Spark SQL external data source support. Since I found this is more aligned with the original [ML pipeline design|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit#] which stores params outside a pipeline component. Thanks. was (Author: yanboliang): [~sethah] Thanks for the design doc. One quick question: In your design, if we set the parameters in optimizer, Do we still support setting these parameters in estimator again? If yes, why we need to support two entrances for the same set of params? I saw you reply at the design doc, you propose to make the params in optimizer superior to the ones in estimator. Does it involves confusion for users and extra maintenance cost? Does the grid search-based model selection in the current framework (such as CrossValidator) can still work well? I'm more prefer to keep these params in estimators, make the optimizer layer as an internal API, and users can register their own optimizer implementation like Spark SQL external data source support. Since I found this is more aligned with the original [ML pipeline design|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit#] which stores params outside a pipeline component. Thanks. > Design optimizer interface for ML algorithms > > > Key: SPARK-17136 > URL: https://issues.apache.org/jira/browse/SPARK-17136 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > We should consider designing an interface that allows users to use their own > optimizers in some of the ML algorithms, similar to MLlib. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17136) Design optimizer interface for ML algorithms
[ https://issues.apache.org/jira/browse/SPARK-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15934736#comment-15934736 ] Yanbo Liang edited comment on SPARK-17136 at 3/21/17 3:18 PM: -- [~sethah] Thanks for the design doc. One quick question: In your design, if we set the parameters in optimizer, Do we still support setting these parameters in estimator again? If yes, why we need to support two entrances for the same set of params? I saw you reply at the design doc, you propose to make the params in optimizer superior to the ones in estimator. Does it involves confusion for users and extra maintenance cost? Does the grid search-based model selection in the current framework (such as CrossValidator) can still work well? I'm more prefer to keep these params in estimators, make the optimizer layer as an internal API, and users can register their own optimizer implementation like Spark SQL external data source support. Since I found this is more aligned with the original [ML pipeline design|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit#] which stores params outside a pipeline component. Thanks. was (Author: yanboliang): [~sethah] Thanks for the design doc. One quick question: In your design, if we set the parameters in optimizer, Do we still support setting these parameters in estimator again? If yes, why we need to support two entrances for the same set of params? I saw you reply at the design doc, you propose to make the params in optimizer superior to the ones in estimator. Does it involves confusion for users and extra maintenance cost? Does the grid search-based model selection in the current framework (such as CrossValidator) can still work well? I'm more prefer to keep these params in estimators, make the optimizer layer as an internal API, and users can register their own optimizer implementation such as the data source support. Since I found this is more aligned with the original [ML pipeline design|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit#] which stores params outside a pipeline component. Thanks. > Design optimizer interface for ML algorithms > > > Key: SPARK-17136 > URL: https://issues.apache.org/jira/browse/SPARK-17136 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > We should consider designing an interface that allows users to use their own > optimizers in some of the ML algorithms, similar to MLlib. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17136) Design optimizer interface for ML algorithms
[ https://issues.apache.org/jira/browse/SPARK-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15934736#comment-15934736 ] Yanbo Liang edited comment on SPARK-17136 at 3/21/17 3:17 PM: -- [~sethah] Thanks for the design doc. One quick question: In your design, if we set the parameters in optimizer, Do we still support setting these parameters in estimator again? If yes, why we need to support two entrances for the same set of params? I saw you reply at the design doc, you propose to make the params in optimizer superior to the ones in estimator. Does it involves confusion for users and extra maintenance cost? Does the grid search-based model selection in the current framework (such as CrossValidator) can still work well? Thanks. I'm more prefer to keep these params in estimators, make the optimizer layer as an internal API, and users can register their own optimizer implementation such as the data source support. Since I found this is more aligned with the original [ML pipeline design|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit#] which stores params outside a pipeline component. was (Author: yanboliang): [~sethah] Thanks for the design doc. One quick question: In your design, if we set the parameters in optimizer, Do we still support setting these parameters in estimator again? If yes, why we need to support two entrances for the same set of params? I saw you reply at the design doc, you propose to make the params in optimizer superior to the ones in estimator. Does it involves confusion for users and extra maintenance cost? Does the grid search-based model selection in the current framework (such as CrossValidator) can still work well? Thanks. > Design optimizer interface for ML algorithms > > > Key: SPARK-17136 > URL: https://issues.apache.org/jira/browse/SPARK-17136 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > We should consider designing an interface that allows users to use their own > optimizers in some of the ML algorithms, similar to MLlib. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17136) Design optimizer interface for ML algorithms
[ https://issues.apache.org/jira/browse/SPARK-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15934736#comment-15934736 ] Yanbo Liang edited comment on SPARK-17136 at 3/21/17 3:17 PM: -- [~sethah] Thanks for the design doc. One quick question: In your design, if we set the parameters in optimizer, Do we still support setting these parameters in estimator again? If yes, why we need to support two entrances for the same set of params? I saw you reply at the design doc, you propose to make the params in optimizer superior to the ones in estimator. Does it involves confusion for users and extra maintenance cost? Does the grid search-based model selection in the current framework (such as CrossValidator) can still work well? I'm more prefer to keep these params in estimators, make the optimizer layer as an internal API, and users can register their own optimizer implementation such as the data source support. Since I found this is more aligned with the original [ML pipeline design|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit#] which stores params outside a pipeline component. Thanks. was (Author: yanboliang): [~sethah] Thanks for the design doc. One quick question: In your design, if we set the parameters in optimizer, Do we still support setting these parameters in estimator again? If yes, why we need to support two entrances for the same set of params? I saw you reply at the design doc, you propose to make the params in optimizer superior to the ones in estimator. Does it involves confusion for users and extra maintenance cost? Does the grid search-based model selection in the current framework (such as CrossValidator) can still work well? Thanks. I'm more prefer to keep these params in estimators, make the optimizer layer as an internal API, and users can register their own optimizer implementation such as the data source support. Since I found this is more aligned with the original [ML pipeline design|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit#] which stores params outside a pipeline component. > Design optimizer interface for ML algorithms > > > Key: SPARK-17136 > URL: https://issues.apache.org/jira/browse/SPARK-17136 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > We should consider designing an interface that allows users to use their own > optimizers in some of the ML algorithms, similar to MLlib. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19998) BlockRDD block not found Exception add RDD id info
[ https://issues.apache.org/jira/browse/SPARK-19998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-19998: - Assignee: jianran.tfh > BlockRDD block not found Exception add RDD id info > -- > > Key: SPARK-19998 > URL: https://issues.apache.org/jira/browse/SPARK-19998 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 2.1.0 >Reporter: jianran.tfh >Assignee: jianran.tfh >Priority: Trivial > Fix For: 2.2.0 > > > "java.lang.Exception: Could not compute split, block $blockId not found" > doesn't have the rdd id info, the "BlockManager: Removing RDD $id" has only > the RDD id, so it couldn't find that the Exception's reason is the Removing; > so it's better block not found Exception add RDD id info -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19998) BlockRDD block not found Exception add RDD id info
[ https://issues.apache.org/jira/browse/SPARK-19998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19998. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17334 [https://github.com/apache/spark/pull/17334] > BlockRDD block not found Exception add RDD id info > -- > > Key: SPARK-19998 > URL: https://issues.apache.org/jira/browse/SPARK-19998 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 2.1.0 >Reporter: jianran.tfh >Assignee: jianran.tfh >Priority: Trivial > Fix For: 2.2.0 > > > "java.lang.Exception: Could not compute split, block $blockId not found" > doesn't have the rdd id info, the "BlockManager: Removing RDD $id" has only > the RDD id, so it couldn't find that the Exception's reason is the Removing; > so it's better block not found Exception add RDD id info -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17136) Design optimizer interface for ML algorithms
[ https://issues.apache.org/jira/browse/SPARK-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15934736#comment-15934736 ] Yanbo Liang commented on SPARK-17136: - [~sethah] Thanks for the design doc. One quick question: In your design, if we set the parameters in optimizer, Do we still support setting these parameters in estimator again? If yes, why we need to support two entrances for the same set of params? I saw you reply at the design doc, you propose to make the params in optimizer superior to the ones in estimator. Does it involves confusion for users and extra maintenance cost? Does the grid search-based model selection in the current framework (such as CrossValidator) can still work well? Thanks. > Design optimizer interface for ML algorithms > > > Key: SPARK-17136 > URL: https://issues.apache.org/jira/browse/SPARK-17136 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > We should consider designing an interface that allows users to use their own > optimizers in some of the ML algorithms, similar to MLlib. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19950) nullable ignored when df.load() is executed for file-based data source
[ https://issues.apache.org/jira/browse/SPARK-19950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15934719#comment-15934719 ] Jason White commented on SPARK-19950: - Without something that allows us to read using the nullable as exists on-disk, we end doing: df = spark.read.parquet(path) return spark.createDataFrame(df.rdd, schema) Which is obviously not desirable. We would much rather rely on the schema as defined by the file format (Parquet in our case), or rely on a user-supplied schema. Preferably both. > nullable ignored when df.load() is executed for file-based data source > -- > > Key: SPARK-19950 > URL: https://issues.apache.org/jira/browse/SPARK-19950 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki > > This problem is reported in [Databricks > forum|https://forums.databricks.com/questions/7123/nullable-seemingly-ignored-when-reading-parquet.html]. > When we execute the following code, a schema for "id" in {{dfRead}} has > {{nullable = true}}. It should be {{nullable = false}}. > {code:java} > val field = "id" > val df = spark.range(0, 5, 1, 1).toDF(field) > val fmt = "parquet" > val path = "/tmp/parquet" > val schema = StructType(Seq(StructField(field, LongType, false))) > df.write.format(fmt).mode("overwrite").save(path) > val dfRead = spark.read.format(fmt).schema(schema).load(path) > dfRead.printSchema > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19949) unify bad record handling in CSV and JSON
[ https://issues.apache.org/jira/browse/SPARK-19949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15934710#comment-15934710 ] Apache Spark commented on SPARK-19949: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/17377 > unify bad record handling in CSV and JSON > - > > Key: SPARK-19949 > URL: https://issues.apache.org/jira/browse/SPARK-19949 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12664) Expose raw prediction scores in MultilayerPerceptronClassificationModel
[ https://issues.apache.org/jira/browse/SPARK-12664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang reassigned SPARK-12664: --- Assignee: Weichen Xu (was: Yanbo Liang) > Expose raw prediction scores in MultilayerPerceptronClassificationModel > --- > > Key: SPARK-12664 > URL: https://issues.apache.org/jira/browse/SPARK-12664 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Robert Dodier >Assignee: Weichen Xu > > In > org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel, > there isn't any way to get raw prediction scores; only an integer output > (from 0 to #classes - 1) is available via the `predict` method. > `mplModel.predict` is called within the class to get the raw score, but > `mlpModel` is private so that isn't available to outside callers. > The raw score is useful when the user wants to interpret the classifier > output as a probability. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20041) Update docs for NaN handling in approxQuantile
[ https://issues.apache.org/jira/browse/SPARK-20041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20041: Assignee: Apache Spark > Update docs for NaN handling in approxQuantile > -- > > Key: SPARK-20041 > URL: https://issues.apache.org/jira/browse/SPARK-20041 > Project: Spark > Issue Type: Improvement > Components: PySpark, SparkR >Affects Versions: 2.2.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Trivial > > {{approxQuantile}} in R and Python now support multi-column, and the current > note about NaN handling is out of date: > {{Note that rows containing any null values will be removed before > calculation.}} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20041) Update docs for NaN handling in approxQuantile
zhengruifeng created SPARK-20041: Summary: Update docs for NaN handling in approxQuantile Key: SPARK-20041 URL: https://issues.apache.org/jira/browse/SPARK-20041 Project: Spark Issue Type: Improvement Components: PySpark, SparkR Affects Versions: 2.2.0 Reporter: zhengruifeng Priority: Trivial {{approxQuantile}} in R and Python now support multi-column, and the current note about NaN handling is out of date: {{Note that rows containing any null values will be removed before calculation.}} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org