[jira] [Assigned] (SPARK-16043) Prepare GenericArrayData implementation specialized for a primitive array
[ https://issues.apache.org/jira/browse/SPARK-16043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16043: Assignee: (was: Apache Spark) > Prepare GenericArrayData implementation specialized for a primitive array > - > > Key: SPARK-16043 > URL: https://issues.apache.org/jira/browse/SPARK-16043 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kazuaki Ishizaki > > There is a ToDo of GenericArrayData class, which is to eliminate > boxing/unboxing for a primitive array (described > [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala#L31]) > It would be good to prepare GenericArrayData implementation specialized for a > primitive array to eliminate boxing/unboxing from the view of runtime memory > footprint and performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16044) input_file_name() returns empty strings in data sources based on NewHadoopRDD.
Hyukjin Kwon created SPARK-16044: Summary: input_file_name() returns empty strings in data sources based on NewHadoopRDD. Key: SPARK-16044 URL: https://issues.apache.org/jira/browse/SPARK-16044 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Hyukjin Kwon The issue is, {{input_file_name()}} function does not contain file paths when data sources use {{NewHadoopRDD}}. This is currently only supported for {{FileScanRDD}} and {{HadoopRDD}}. To be clear, this does not affect Spark's internal data sources because currently they all do not use {{NewHadoopRDD}}. However, there are several datasources using this. For example, spark-redshift - [here|https://github.com/databricks/spark-redshift/blob/cba5eee1ab79ae8f0fa9e668373a54d2b5babf6b/src/main/scala/com/databricks/spark/redshift/RedshiftRelation.scala#L149] spark-xml - [here|https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/util/XmlFile.scala#L39-L47] Currently, using this functions shows the output below: {code} +-+ |input_file_name()| +-+ | | | | | | | | | | | | | | | | | | | | | | +-+ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16043) Prepare GenericArrayData implementation specialized for a primitive array
[ https://issues.apache.org/jira/browse/SPARK-16043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337617#comment-15337617 ] Apache Spark commented on SPARK-16043: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/13758 > Prepare GenericArrayData implementation specialized for a primitive array > - > > Key: SPARK-16043 > URL: https://issues.apache.org/jira/browse/SPARK-16043 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kazuaki Ishizaki > > There is a ToDo of GenericArrayData class, which is to eliminate > boxing/unboxing for a primitive array (described > [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala#L31]) > It would be good to prepare GenericArrayData implementation specialized for a > primitive array to eliminate boxing/unboxing from the view of runtime memory > footprint and performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16043) Prepare GenericArrayData implementation specialized for a primitive array
[ https://issues.apache.org/jira/browse/SPARK-16043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16043: Assignee: Apache Spark > Prepare GenericArrayData implementation specialized for a primitive array > - > > Key: SPARK-16043 > URL: https://issues.apache.org/jira/browse/SPARK-16043 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kazuaki Ishizaki >Assignee: Apache Spark > > There is a ToDo of GenericArrayData class, which is to eliminate > boxing/unboxing for a primitive array (described > [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala#L31]) > It would be good to prepare GenericArrayData implementation specialized for a > primitive array to eliminate boxing/unboxing from the view of runtime memory > footprint and performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16022) Input size is different when I use 1 or 3 nodes but the shufle size remains +- icual, do you know why?
[ https://issues.apache.org/jira/browse/SPARK-16022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337615#comment-15337615 ] Sean Owen commented on SPARK-16022: --- The u...@spark.apache.org mailing list http://spark.apache.org/community.html > Input size is different when I use 1 or 3 nodes but the shufle size remains > +- icual, do you know why? > -- > > Key: SPARK-16022 > URL: https://issues.apache.org/jira/browse/SPARK-16022 > Project: Spark > Issue Type: Test >Reporter: jon > > I run some queries on spark with just one node and then with 3 nodes. And in > the spark:4040 UI I see something that I am not understanding. > For example after executing a query with 3 nodes and check the results in the > spark UI, in the "input" tab appears 2,8gb, so spark read 2,8gb from hadoop. > The same query on hadoop with just one node in local mode appears 7,3gb, the > spark read 7,3GB from hadoop. But this value shouldnt be equal? > For example the value of shuffle remains +- equal in one node vs 3. Why the > input value doesn't stay equal? The same amount of data must be read from the > hdfs, so I am not understanding. > Do you know? > Single node: > Input: 7,3 GB > Shuffle read: 208.1kb > Shuffle write: 208.1kb > 3 nodes: > Input: 2,8 GB > Shuffle read: 193,3 kb > Shuffle write; 208.1 kb -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16040) spark.mllib PIC document extra line of refernece
[ https://issues.apache.org/jira/browse/SPARK-16040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16040: -- Priority: Trivial (was: Minor) OK, this does not need a JIRA > spark.mllib PIC document extra line of refernece > > > Key: SPARK-16040 > URL: https://issues.apache.org/jira/browse/SPARK-16040 > Project: Spark > Issue Type: Documentation >Reporter: Miao Wang >Priority: Trivial > > In the 2.0 document, Line "A full example that produces the experiment > described in the PIC paper can be found under examples/." is redundant. > There is already "Find full example code at > "examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala" > in the Spark repo.". > We should remove the first line, which is consistent with other documents. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16043) Prepare GenericArrayData implementation specialized for a primitive array
Kazuaki Ishizaki created SPARK-16043: Summary: Prepare GenericArrayData implementation specialized for a primitive array Key: SPARK-16043 URL: https://issues.apache.org/jira/browse/SPARK-16043 Project: Spark Issue Type: Improvement Components: SQL Reporter: Kazuaki Ishizaki There is a ToDo of GenericArrayData class, which is to eliminate boxing/unboxing for a primitive array (described [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala#L31]) It would be good to prepare GenericArrayData implementation specialized for a primitive array to eliminate boxing/unboxing from the view of runtime memory footprint and performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15973) Fix GroupedData Documentation
[ https://issues.apache.org/jira/browse/SPARK-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15973. - Resolution: Fixed Fix Version/s: 2.0.0 > Fix GroupedData Documentation > - > > Key: SPARK-15973 > URL: https://issues.apache.org/jira/browse/SPARK-15973 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0, 2.1.0 >Reporter: Vladimir Feinberg >Priority: Trivial > Fix For: 2.0.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > (1) > {{GroupedData.pivot}} documenation uses {{//}} instead of {{#}} for doctest > python comments, which messes up formatting in the documentation as well as > the doctests themselves. > A PR resolving this should probably resolve the other places this happens in > pyspark. > (2) > Simple aggregation functions which take column names {{cols}} as varargs > arguments show up in documentation with the argument {{args}}, but their > documentation refers to {{cols}}. > The discrepancy is caused by an annotation, {{df_varargs_api}}, which > produces a temporary function with arguments {{args}} instead of {{cols}}, > creating the confusing documentation. > (3) > The {{pyspark.sql.GroupedData}} object calls the Java object it wraps around > as the member variable {{self._jdf}}, which is exactly the same as > {{pyspark.sql.DataFrame}}, when referring its object. > The acronym is incorrect, standing for "Java DataFrame" instead of what > should be "Java GroupedData". As such, the name should be changed to > {{self._jgd}} - in fact, in the {{DataFrame.groupBy}} implementation, the > java object is referred to as exactly {{jgd}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16025) Document OFF_HEAP storage level in 2.0
[ https://issues.apache.org/jira/browse/SPARK-16025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16025: -- Priority: Minor (was: Major) > Document OFF_HEAP storage level in 2.0 > -- > > Key: SPARK-16025 > URL: https://issues.apache.org/jira/browse/SPARK-16025 > Project: Spark > Issue Type: Documentation >Reporter: Eric Liang >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16023) Move InMemoryRelation to its own file
[ https://issues.apache.org/jira/browse/SPARK-16023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16023: -- Issue Type: Improvement (was: Bug) > Move InMemoryRelation to its own file > - > > Key: SPARK-16023 > URL: https://issues.apache.org/jira/browse/SPARK-16023 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > Fix For: 2.0.0 > > > Just to make InMemoryTableScanExec a little smaller and more readable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16023) Move InMemoryRelation to its own file
[ https://issues.apache.org/jira/browse/SPARK-16023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16023. - Resolution: Fixed Fix Version/s: 2.0.0 > Move InMemoryRelation to its own file > - > > Key: SPARK-16023 > URL: https://issues.apache.org/jira/browse/SPARK-16023 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > Fix For: 2.0.0 > > > Just to make InMemoryTableScanExec a little smaller and more readable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16042) Eliminate nullcheck code at projection for an array type
[ https://issues.apache.org/jira/browse/SPARK-16042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16042: Assignee: Apache Spark > Eliminate nullcheck code at projection for an array type > > > Key: SPARK-16042 > URL: https://issues.apache.org/jira/browse/SPARK-16042 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kazuaki Ishizaki >Assignee: Apache Spark > > When we run a spark program with a projection for a array type, nullcheck at > a call to write each element of an array is generated. If we know all of the > elements do not have {{null}} at compilation time, we can eliminate code for > nullcheck. > {code} > val df = sparkContext.parallelize(Seq(1.0, 2.0), 1).toDF("v") > df.selectExpr("Array(v + 2.2, v + 3.3)").collect > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16042) Eliminate nullcheck code at projection for an array type
[ https://issues.apache.org/jira/browse/SPARK-16042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16042: Assignee: (was: Apache Spark) > Eliminate nullcheck code at projection for an array type > > > Key: SPARK-16042 > URL: https://issues.apache.org/jira/browse/SPARK-16042 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kazuaki Ishizaki > > When we run a spark program with a projection for a array type, nullcheck at > a call to write each element of an array is generated. If we know all of the > elements do not have {{null}} at compilation time, we can eliminate code for > nullcheck. > {code} > val df = sparkContext.parallelize(Seq(1.0, 2.0), 1).toDF("v") > df.selectExpr("Array(v + 2.2, v + 3.3)").collect > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16042) Eliminate nullcheck code at projection for an array type
[ https://issues.apache.org/jira/browse/SPARK-16042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337594#comment-15337594 ] Apache Spark commented on SPARK-16042: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/13757 > Eliminate nullcheck code at projection for an array type > > > Key: SPARK-16042 > URL: https://issues.apache.org/jira/browse/SPARK-16042 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kazuaki Ishizaki > > When we run a spark program with a projection for a array type, nullcheck at > a call to write each element of an array is generated. If we know all of the > elements do not have {{null}} at compilation time, we can eliminate code for > nullcheck. > {code} > val df = sparkContext.parallelize(Seq(1.0, 2.0), 1).toDF("v") > df.selectExpr("Array(v + 2.2, v + 3.3)").collect > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16042) Eliminate nullcheck code at projection for an array type
Kazuaki Ishizaki created SPARK-16042: Summary: Eliminate nullcheck code at projection for an array type Key: SPARK-16042 URL: https://issues.apache.org/jira/browse/SPARK-16042 Project: Spark Issue Type: Improvement Components: SQL Reporter: Kazuaki Ishizaki When we run a spark program with a projection for a array type, nullcheck at a call to write each element of an array is generated. If we know all of the elements do not have {{null}} at compilation time, we can eliminate code for nullcheck. {code} val df = sparkContext.parallelize(Seq(1.0, 2.0), 1).toDF("v") df.selectExpr("Array(v + 2.2, v + 3.3)").collect {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15803) Support with statement syntax for SparkSession
[ https://issues.apache.org/jira/browse/SPARK-15803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-15803: --- Assignee: Jeff Zhang > Support with statement syntax for SparkSession > -- > > Key: SPARK-15803 > URL: https://issues.apache.org/jira/browse/SPARK-15803 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Jeff Zhang >Assignee: Jeff Zhang >Priority: Minor > Fix For: 2.0.0 > > > It would be nice to support with statement syntax for SparkSession like > following > {code} > with SparkSession.builder.(...).getOrCreate() as session: > session.sql("show tables").show() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15803) Support with statement syntax for SparkSession
[ https://issues.apache.org/jira/browse/SPARK-15803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-15803. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13541 [https://github.com/apache/spark/pull/13541] > Support with statement syntax for SparkSession > -- > > Key: SPARK-15803 > URL: https://issues.apache.org/jira/browse/SPARK-15803 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Jeff Zhang >Priority: Minor > Fix For: 2.0.0 > > > It would be nice to support with statement syntax for SparkSession like > following > {code} > with SparkSession.builder.(...).getOrCreate() as session: > session.sql("show tables").show() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16035) The SparseVector parser fails checking for valid end parenthesis
[ https://issues.apache.org/jira/browse/SPARK-16035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16035: -- Assignee: Andrea Pasqua > The SparseVector parser fails checking for valid end parenthesis > > > Key: SPARK-16035 > URL: https://issues.apache.org/jira/browse/SPARK-16035 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.6.1, 2.0.0 >Reporter: Andrea Pasqua >Assignee: Andrea Pasqua >Priority: Minor > Fix For: 1.6.2, 2.0.0 > > > Running > SparseVector.parse(' (4, [0,1 ],[ 4.0,5.0] ') > will not raise an exception as expected, although it parses it as if there > was an end parenthesis. > This can be fixed by replacing > if start == -1: >raise ValueError("Tuple should end with ')'") > with > if end == -1: >raise ValueError("Tuple should end with ')'") > Please see posted PR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16035) The SparseVector parser fails checking for valid end parenthesis
[ https://issues.apache.org/jira/browse/SPARK-16035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-16035. --- Resolution: Fixed Fix Version/s: 1.6.2 2.0.0 Issue resolved by pull request 13750 [https://github.com/apache/spark/pull/13750] > The SparseVector parser fails checking for valid end parenthesis > > > Key: SPARK-16035 > URL: https://issues.apache.org/jira/browse/SPARK-16035 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.6.1, 2.0.0 >Reporter: Andrea Pasqua >Priority: Minor > Fix For: 2.0.0, 1.6.2 > > > Running > SparseVector.parse(' (4, [0,1 ],[ 4.0,5.0] ') > will not raise an exception as expected, although it parses it as if there > was an end parenthesis. > This can be fixed by replacing > if start == -1: >raise ValueError("Tuple should end with ')'") > with > if end == -1: >raise ValueError("Tuple should end with ')'") > Please see posted PR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16041) Disallow Duplicate Columns in `partitionBy`, `blockBy` and `sortBy` in DataFrameWriter
[ https://issues.apache.org/jira/browse/SPARK-16041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16041: Assignee: Apache Spark > Disallow Duplicate Columns in `partitionBy`, `blockBy` and `sortBy` in > DataFrameWriter > -- > > Key: SPARK-16041 > URL: https://issues.apache.org/jira/browse/SPARK-16041 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Apache Spark > > Duplicate columns are not allowed in `partitionBy`, `blockBy`, `sortBy` in > DataFrameWriter. The duplicate columns could cause unpredictable results. For > example, the resolution failure. > We should detect the duplicates and issue exceptions with appropriate > messages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16041) Disallow Duplicate Columns in `partitionBy`, `blockBy` and `sortBy` in DataFrameWriter
[ https://issues.apache.org/jira/browse/SPARK-16041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16041: Assignee: (was: Apache Spark) > Disallow Duplicate Columns in `partitionBy`, `blockBy` and `sortBy` in > DataFrameWriter > -- > > Key: SPARK-16041 > URL: https://issues.apache.org/jira/browse/SPARK-16041 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Duplicate columns are not allowed in `partitionBy`, `blockBy`, `sortBy` in > DataFrameWriter. The duplicate columns could cause unpredictable results. For > example, the resolution failure. > We should detect the duplicates and issue exceptions with appropriate > messages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16041) Disallow Duplicate Columns in `partitionBy`, `blockBy` and `sortBy` in DataFrameWriter
[ https://issues.apache.org/jira/browse/SPARK-16041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337552#comment-15337552 ] Apache Spark commented on SPARK-16041: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/13756 > Disallow Duplicate Columns in `partitionBy`, `blockBy` and `sortBy` in > DataFrameWriter > -- > > Key: SPARK-16041 > URL: https://issues.apache.org/jira/browse/SPARK-16041 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Duplicate columns are not allowed in `partitionBy`, `blockBy`, `sortBy` in > DataFrameWriter. The duplicate columns could cause unpredictable results. For > example, the resolution failure. > We should detect the duplicates and issue exceptions with appropriate > messages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16041) Disallow Duplicate Columns in `partitionBy`, `blockBy` and `sortBy` in DataFrameWriter
[ https://issues.apache.org/jira/browse/SPARK-16041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-16041: Description: Duplicate columns are not allowed in `partitionBy`, `blockBy`, `sortBy` in . The duplicate columns could cause unpredictable results. For example, the resolution failure. We should detect the duplicates and issue exceptions with appropriate messages. was: Duplicate columns are not allowed in `partitionBy`, `blockBy`, `sortBy`. The duplicate columns could cause unpredictable results. For example, the resolution failure. We should detect the duplicates and issue exceptions with appropriate messages. > Disallow Duplicate Columns in `partitionBy`, `blockBy` and `sortBy` in > DataFrameWriter > -- > > Key: SPARK-16041 > URL: https://issues.apache.org/jira/browse/SPARK-16041 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Duplicate columns are not allowed in `partitionBy`, `blockBy`, `sortBy` in . > The duplicate columns could cause unpredictable results. For example, the > resolution failure. > We should detect the duplicates and issue exceptions with appropriate > messages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16041) Disallow Duplicate Columns in `partitionBy`, `blockBy` and `sortBy` in DataFrameWriter
[ https://issues.apache.org/jira/browse/SPARK-16041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-16041: Description: Duplicate columns are not allowed in `partitionBy`, `blockBy`, `sortBy` in DataFrameWriter. The duplicate columns could cause unpredictable results. For example, the resolution failure. We should detect the duplicates and issue exceptions with appropriate messages. was: Duplicate columns are not allowed in `partitionBy`, `blockBy`, `sortBy` in . The duplicate columns could cause unpredictable results. For example, the resolution failure. We should detect the duplicates and issue exceptions with appropriate messages. > Disallow Duplicate Columns in `partitionBy`, `blockBy` and `sortBy` in > DataFrameWriter > -- > > Key: SPARK-16041 > URL: https://issues.apache.org/jira/browse/SPARK-16041 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Duplicate columns are not allowed in `partitionBy`, `blockBy`, `sortBy` in > DataFrameWriter. The duplicate columns could cause unpredictable results. For > example, the resolution failure. > We should detect the duplicates and issue exceptions with appropriate > messages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16041) Disallow Duplicate Columns in `partitionBy`, `blockBy` and `sortBy`
Xiao Li created SPARK-16041: --- Summary: Disallow Duplicate Columns in `partitionBy`, `blockBy` and `sortBy` Key: SPARK-16041 URL: https://issues.apache.org/jira/browse/SPARK-16041 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Xiao Li Duplicate columns are not allowed in `partitionBy`, `blockBy`, `sortBy`. The duplicate columns could cause unpredictable results. For example, the resolution failure. We should detect the duplicates and issue exceptions with appropriate messages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16041) Disallow Duplicate Columns in `partitionBy`, `blockBy` and `sortBy` in DataFrameWriter
[ https://issues.apache.org/jira/browse/SPARK-16041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-16041: Summary: Disallow Duplicate Columns in `partitionBy`, `blockBy` and `sortBy` in DataFrameWriter (was: Disallow Duplicate Columns in `partitionBy`, `blockBy` and `sortBy` ) > Disallow Duplicate Columns in `partitionBy`, `blockBy` and `sortBy` in > DataFrameWriter > -- > > Key: SPARK-16041 > URL: https://issues.apache.org/jira/browse/SPARK-16041 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Duplicate columns are not allowed in `partitionBy`, `blockBy`, `sortBy`. The > duplicate columns could cause unpredictable results. For example, the > resolution failure. > We should detect the duplicates and issue exceptions with appropriate > messages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16040) spark.mllib PIC document extra line of refernece
[ https://issues.apache.org/jira/browse/SPARK-16040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16040: Assignee: (was: Apache Spark) > spark.mllib PIC document extra line of refernece > > > Key: SPARK-16040 > URL: https://issues.apache.org/jira/browse/SPARK-16040 > Project: Spark > Issue Type: Documentation >Reporter: Miao Wang >Priority: Minor > > In the 2.0 document, Line "A full example that produces the experiment > described in the PIC paper can be found under examples/." is redundant. > There is already "Find full example code at > "examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala" > in the Spark repo.". > We should remove the first line, which is consistent with other documents. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16040) spark.mllib PIC document extra line of refernece
[ https://issues.apache.org/jira/browse/SPARK-16040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16040: Assignee: Apache Spark > spark.mllib PIC document extra line of refernece > > > Key: SPARK-16040 > URL: https://issues.apache.org/jira/browse/SPARK-16040 > Project: Spark > Issue Type: Documentation >Reporter: Miao Wang >Assignee: Apache Spark >Priority: Minor > > In the 2.0 document, Line "A full example that produces the experiment > described in the PIC paper can be found under examples/." is redundant. > There is already "Find full example code at > "examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala" > in the Spark repo.". > We should remove the first line, which is consistent with other documents. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16040) spark.mllib PIC document extra line of refernece
[ https://issues.apache.org/jira/browse/SPARK-16040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337550#comment-15337550 ] Apache Spark commented on SPARK-16040: -- User 'wangmiao1981' has created a pull request for this issue: https://github.com/apache/spark/pull/13755 > spark.mllib PIC document extra line of refernece > > > Key: SPARK-16040 > URL: https://issues.apache.org/jira/browse/SPARK-16040 > Project: Spark > Issue Type: Documentation >Reporter: Miao Wang >Priority: Minor > > In the 2.0 document, Line "A full example that produces the experiment > described in the PIC paper can be found under examples/." is redundant. > There is already "Find full example code at > "examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala" > in the Spark repo.". > We should remove the first line, which is consistent with other documents. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16040) spark.mllib PIC document extra line of refernece
Miao Wang created SPARK-16040: - Summary: spark.mllib PIC document extra line of refernece Key: SPARK-16040 URL: https://issues.apache.org/jira/browse/SPARK-16040 Project: Spark Issue Type: Documentation Reporter: Miao Wang Priority: Minor In the 2.0 document, Line "A full example that produces the experiment described in the PIC paper can be found under examples/." is redundant. There is already "Find full example code at "examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala" in the Spark repo.". We should remove the first line, which is consistent with other documents. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16020) Fix complete mode aggregation with console sink
[ https://issues.apache.org/jira/browse/SPARK-16020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-16020. -- Resolution: Fixed Fix Version/s: 2.0.0 > Fix complete mode aggregation with console sink > --- > > Key: SPARK-16020 > URL: https://issues.apache.org/jira/browse/SPARK-16020 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.0.0 > > > Complete mode aggregation doesn't work with console sink. ConsoleSink just > shows the new data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16037) use by-position resolution when insert into hive table
[ https://issues.apache.org/jira/browse/SPARK-16037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16037: Assignee: Apache Spark (was: Wenchen Fan) > use by-position resolution when insert into hive table > -- > > Key: SPARK-16037 > URL: https://issues.apache.org/jira/browse/SPARK-16037 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark > > INSERT INTO TABLE src SELECT 1, 2 AS c, 3 AS b; > The result is 1, 3, 2 for hive table, which is wrong -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16036) better error message if the number of columns in SELECT clause doesn't match the table schema
[ https://issues.apache.org/jira/browse/SPARK-16036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16036: Assignee: Wenchen Fan (was: Apache Spark) > better error message if the number of columns in SELECT clause doesn't match > the table schema > - > > Key: SPARK-16036 > URL: https://issues.apache.org/jira/browse/SPARK-16036 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > > INSERT INTO TABLE src PARTITION(b=2, c=3) SELECT 4, 5, 6; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16036) better error message if the number of columns in SELECT clause doesn't match the table schema
[ https://issues.apache.org/jira/browse/SPARK-16036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16036: Assignee: Apache Spark (was: Wenchen Fan) > better error message if the number of columns in SELECT clause doesn't match > the table schema > - > > Key: SPARK-16036 > URL: https://issues.apache.org/jira/browse/SPARK-16036 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark > > INSERT INTO TABLE src PARTITION(b=2, c=3) SELECT 4, 5, 6; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16037) use by-position resolution when insert into hive table
[ https://issues.apache.org/jira/browse/SPARK-16037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337543#comment-15337543 ] Apache Spark commented on SPARK-16037: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/13754 > use by-position resolution when insert into hive table > -- > > Key: SPARK-16037 > URL: https://issues.apache.org/jira/browse/SPARK-16037 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > > INSERT INTO TABLE src SELECT 1, 2 AS c, 3 AS b; > The result is 1, 3, 2 for hive table, which is wrong -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16037) use by-position resolution when insert into hive table
[ https://issues.apache.org/jira/browse/SPARK-16037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16037: Assignee: Wenchen Fan (was: Apache Spark) > use by-position resolution when insert into hive table > -- > > Key: SPARK-16037 > URL: https://issues.apache.org/jira/browse/SPARK-16037 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > > INSERT INTO TABLE src SELECT 1, 2 AS c, 3 AS b; > The result is 1, 3, 2 for hive table, which is wrong -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16036) better error message if the number of columns in SELECT clause doesn't match the table schema
[ https://issues.apache.org/jira/browse/SPARK-16036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337542#comment-15337542 ] Apache Spark commented on SPARK-16036: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/13754 > better error message if the number of columns in SELECT clause doesn't match > the table schema > - > > Key: SPARK-16036 > URL: https://issues.apache.org/jira/browse/SPARK-16036 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > > INSERT INTO TABLE src PARTITION(b=2, c=3) SELECT 4, 5, 6; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16029) Deprecate dropTempTable in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16029: Assignee: Apache Spark > Deprecate dropTempTable in SparkR > - > > Key: SPARK-16029 > URL: https://issues.apache.org/jira/browse/SPARK-16029 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Shivaram Venkataraman >Assignee: Apache Spark > > This should be called dropTempTable to match the new Scala API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16029) Deprecate dropTempTable in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16029: Assignee: (was: Apache Spark) > Deprecate dropTempTable in SparkR > - > > Key: SPARK-16029 > URL: https://issues.apache.org/jira/browse/SPARK-16029 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Shivaram Venkataraman > > This should be called dropTempTable to match the new Scala API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16029) Deprecate dropTempTable in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337536#comment-15337536 ] Apache Spark commented on SPARK-16029: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/13753 > Deprecate dropTempTable in SparkR > - > > Key: SPARK-16029 > URL: https://issues.apache.org/jira/browse/SPARK-16029 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Shivaram Venkataraman > > This should be called dropTempTable to match the new Scala API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-16038) we can omit partition list when insert into hive table
[ https://issues.apache.org/jira/browse/SPARK-16038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan deleted SPARK-16038: > we can omit partition list when insert into hive table > -- > > Key: SPARK-16038 > URL: https://issues.apache.org/jira/browse/SPARK-16038 > Project: Spark > Issue Type: Sub-task >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16028) Remove the need to pass in a SparkContext for spark.lapply
[ https://issues.apache.org/jira/browse/SPARK-16028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16028: Assignee: Apache Spark > Remove the need to pass in a SparkContext for spark.lapply > --- > > Key: SPARK-16028 > URL: https://issues.apache.org/jira/browse/SPARK-16028 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Shivaram Venkataraman >Assignee: Apache Spark > > Similar to https://github.com/apache/spark/pull/9192 and SPARK-10903 we > should remove the need to pass in SparkContext to `spark.lapply` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16028) Remove the need to pass in a SparkContext for spark.lapply
[ https://issues.apache.org/jira/browse/SPARK-16028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16028: Assignee: (was: Apache Spark) > Remove the need to pass in a SparkContext for spark.lapply > --- > > Key: SPARK-16028 > URL: https://issues.apache.org/jira/browse/SPARK-16028 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Shivaram Venkataraman > > Similar to https://github.com/apache/spark/pull/9192 and SPARK-10903 we > should remove the need to pass in SparkContext to `spark.lapply` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16028) Remove the need to pass in a SparkContext for spark.lapply
[ https://issues.apache.org/jira/browse/SPARK-16028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337534#comment-15337534 ] Apache Spark commented on SPARK-16028: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/13752 > Remove the need to pass in a SparkContext for spark.lapply > --- > > Key: SPARK-16028 > URL: https://issues.apache.org/jira/browse/SPARK-16028 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Shivaram Venkataraman > > Similar to https://github.com/apache/spark/pull/9192 and SPARK-10903 we > should remove the need to pass in SparkContext to `spark.lapply` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15159) SparkSession R API
[ https://issues.apache.org/jira/browse/SPARK-15159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337532#comment-15337532 ] Apache Spark commented on SPARK-15159: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/13751 > SparkSession R API > -- > > Key: SPARK-15159 > URL: https://issues.apache.org/jira/browse/SPARK-15159 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.1 >Reporter: Sun Rui >Assignee: Felix Cheung >Priority: Blocker > Fix For: 2.0.0 > > > HiveContext is to be deprecated in 2.0. Replace them with > SparkSession.builder.enableHiveSupport in SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9857) Add expression functions into SparkR which conflict with the existing R's generic
[ https://issues.apache.org/jira/browse/SPARK-9857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337526#comment-15337526 ] Shivaram Venkataraman commented on SPARK-9857: -- [~yuu.ishik...@gmail.com] [~sunrui] Do we know what other functions fall into this category ? I'm trying to see if this work is done or if we have missed something here etc. > Add expression functions into SparkR which conflict with the existing R's > generic > - > > Key: SPARK-9857 > URL: https://issues.apache.org/jira/browse/SPARK-9857 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Yu Ishikawa > > Add expression functions into SparkR which conflict with the existing R's > generic, like {{coalesce(e: Column*)}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15124) R 2.0 QA: New R APIs and API docs
[ https://issues.apache.org/jira/browse/SPARK-15124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337523#comment-15337523 ] Shivaram Venkataraman commented on SPARK-15124: --- One more item on this list is the SparkSession change we merged recently. We'll need to update the examples and programming guide to reflect this cc [~dongjoon] > R 2.0 QA: New R APIs and API docs > - > > Key: SPARK-15124 > URL: https://issues.apache.org/jira/browse/SPARK-15124 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Blocker > > Audit new public R APIs. Take note of: > * Correctness and uniformity of API > * Documentation: Missing? Bad links or formatting? > ** Check both the generated docs linked from the user guide and the R command > line docs `?read.df`. These are generated using roxygen. > As you find issues, please create JIRAs and link them to this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337525#comment-15337525 ] Shivaram Venkataraman commented on SPARK-6817: -- I think all the ones we need for 2.0 are completed here. [~srowen] Is there a clean way to mark the umbrella as complete for 2.0 and retarget the remaining for 2.1 ? > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Shivaram Venkataraman > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15159) SparkSession R API
[ https://issues.apache.org/jira/browse/SPARK-15159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-15159. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13635 [https://github.com/apache/spark/pull/13635] > SparkSession R API > -- > > Key: SPARK-15159 > URL: https://issues.apache.org/jira/browse/SPARK-15159 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.1 >Reporter: Sun Rui >Priority: Blocker > Fix For: 2.0.0 > > > HiveContext is to be deprecated in 2.0. Replace them with > SparkSession.builder.enableHiveSupport in SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15159) SparkSession R API
[ https://issues.apache.org/jira/browse/SPARK-15159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-15159: -- Assignee: Felix Cheung > SparkSession R API > -- > > Key: SPARK-15159 > URL: https://issues.apache.org/jira/browse/SPARK-15159 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.1 >Reporter: Sun Rui >Assignee: Felix Cheung >Priority: Blocker > Fix For: 2.0.0 > > > HiveContext is to be deprecated in 2.0. Replace them with > SparkSession.builder.enableHiveSupport in SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15946) Wrap the conversion utils in Python
[ https://issues.apache.org/jira/browse/SPARK-15946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-15946. - Resolution: Fixed Fix Version/s: 2.0.0 > Wrap the conversion utils in Python > --- > > Key: SPARK-15946 > URL: https://issues.apache.org/jira/browse/SPARK-15946 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Fix For: 2.0.0 > > > This is to wrap SPARK-15945 in Python. So Python users can use it to convert > DataFrames with vector columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15129) Clarify conventions for calling Spark and MLlib from R
[ https://issues.apache.org/jira/browse/SPARK-15129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-15129. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13285 [https://github.com/apache/spark/pull/13285] > Clarify conventions for calling Spark and MLlib from R > -- > > Key: SPARK-15129 > URL: https://issues.apache.org/jira/browse/SPARK-15129 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML, SparkR >Reporter: Joseph K. Bradley >Assignee: Gayathri Murali >Priority: Blocker > Fix For: 2.0.0 > > > Since some R API modifications happened in 2.0, we need to make the new > standards clear in the user guide. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15892) Incorrectly merged AFTAggregator with zero total count
[ https://issues.apache.org/jira/browse/SPARK-15892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15892: -- Fix Version/s: 2.0.0 > Incorrectly merged AFTAggregator with zero total count > -- > > Key: SPARK-15892 > URL: https://issues.apache.org/jira/browse/SPARK-15892 > Project: Spark > Issue Type: Bug > Components: Examples, ML, PySpark >Affects Versions: 1.6.1, 2.0.0 >Reporter: Joseph K. Bradley >Assignee: Hyukjin Kwon > Fix For: 1.6.2, 2.0.0 > > > Running the example (after the fix in > [https://github.com/apache/spark/pull/13393]) causes this failure: > {code} > Traceback (most recent call last): > > File > "/Users/josephkb/spark/examples/src/main/python/ml/aft_survival_regression.py", > line 49, in > model = aft.fit(training) > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/base.py", > line 64, in fit > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", > line 213, in _fit > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", > line 210, in _fit_java > File > "/Users/josephkb/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", > line 79, in deco > pyspark.sql.utils.IllegalArgumentException: u'requirement failed: The number > of instances should be greater than 0.0, but got 0.' > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15892) Incorrectly merged AFTAggregator with zero total count
[ https://issues.apache.org/jira/browse/SPARK-15892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-15892. --- Resolution: Fixed Fix Version/s: (was: 2.0.0) 1.6.2 Issue resolved by pull request 13725 [https://github.com/apache/spark/pull/13725] > Incorrectly merged AFTAggregator with zero total count > -- > > Key: SPARK-15892 > URL: https://issues.apache.org/jira/browse/SPARK-15892 > Project: Spark > Issue Type: Bug > Components: Examples, ML, PySpark >Affects Versions: 1.6.1, 2.0.0 >Reporter: Joseph K. Bradley >Assignee: Hyukjin Kwon > Fix For: 1.6.2 > > > Running the example (after the fix in > [https://github.com/apache/spark/pull/13393]) causes this failure: > {code} > Traceback (most recent call last): > > File > "/Users/josephkb/spark/examples/src/main/python/ml/aft_survival_regression.py", > line 49, in > model = aft.fit(training) > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/base.py", > line 64, in fit > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", > line 213, in _fit > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", > line 210, in _fit_java > File > "/Users/josephkb/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", > line 79, in deco > pyspark.sql.utils.IllegalArgumentException: u'requirement failed: The number > of instances should be greater than 0.0, but got 0.' > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15603) Replace SQLContext with SparkSession in ML/MLLib
[ https://issues.apache.org/jira/browse/SPARK-15603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-15603. --- Resolution: Fixed Fix Version/s: 2.0.0 > Replace SQLContext with SparkSession in ML/MLLib > > > Key: SPARK-15603 > URL: https://issues.apache.org/jira/browse/SPARK-15603 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun > Fix For: 2.0.0 > > > This issue replaces all deprecated `SQLContext` occurrences with > `SparkSession` in `ML/MLLib` module except the following two classes. These > two classes use `SQLContext` as their function arguments. > - ReadWrite.scala > - TreeModels.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16033) DataFrameWriter.partitionBy() can't be used together with DataFrameWriter.insertInto()
[ https://issues.apache.org/jira/browse/SPARK-16033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-16033. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13747 [https://github.com/apache/spark/pull/13747] > DataFrameWriter.partitionBy() can't be used together with > DataFrameWriter.insertInto() > -- > > Key: SPARK-16033 > URL: https://issues.apache.org/jira/browse/SPARK-16033 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > Fix For: 2.0.0 > > > When inserting into an existing partitioned table, partitioning columns > should always be determined by catalog metadata of the existing table to be > inserted. Extra {{partitionBy()}} calls don't make sense, and mess up > existing data because newly inserted data may have wrong partitioning > directory layout. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16028) Remove the need to pass in a SparkContext for spark.lapply
[ https://issues.apache.org/jira/browse/SPARK-16028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337475#comment-15337475 ] Felix Cheung commented on SPARK-16028: -- Fix ready as soon as the parent PR is merged. > Remove the need to pass in a SparkContext for spark.lapply > --- > > Key: SPARK-16028 > URL: https://issues.apache.org/jira/browse/SPARK-16028 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Shivaram Venkataraman > > Similar to https://github.com/apache/spark/pull/9192 and SPARK-10903 we > should remove the need to pass in SparkContext to `spark.lapply` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16027) Fix SparkR session unit test
[ https://issues.apache.org/jira/browse/SPARK-16027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337476#comment-15337476 ] Felix Cheung commented on SPARK-16027: -- Fix ready as soon as parent PR is merged. > Fix SparkR session unit test > > > Key: SPARK-16027 > URL: https://issues.apache.org/jira/browse/SPARK-16027 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Shivaram Venkataraman > > As described in https://github.com/apache/spark/pull/13635/files, the test > titled "repeatedly starting and stopping SparkR" does not seem to work > consistently with the new sparkR.session code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16039) Spark SQL - Number of rows inserted by Insert Sql
Prabhu Kasinathan created SPARK-16039: - Summary: Spark SQL - Number of rows inserted by Insert Sql Key: SPARK-16039 URL: https://issues.apache.org/jira/browse/SPARK-16039 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.6.1 Reporter: Prabhu Kasinathan Insert spark sql currently returns only "OK" and time taken. But, it would be good, if insert sql returns number of rows inserted into target table. Example: {code} INSERT INTO TABLE target SELECT * FROM source; 1000 rows inserted OK Time taken: 1 min 30 secs {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15340) Limit the size of the map used to cache JobConfs to void OOM
[ https://issues.apache.org/jira/browse/SPARK-15340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337421#comment-15337421 ] Zhongshuai Pei edited comment on SPARK-15340 at 6/18/16 1:47 AM: - [~clockfly] 1. I run in the cluster mode on YARN and use beeline 2. i run tpcds(500g and must be orc) and set driver.memory 30g 3. it is heap space OOM.you can run " jstat -gc pid" and will find the memory of old grow fast and will not be released 4. i run tpcds for 5 hours and OOM happened was (Author: doingdone9): [~clockfly] 1. I run in the cluster mode on YARN and use spark-sql 2. i run tpcds(500g and must be orc) and set driver.memory 30g 3. it is heap space OOM.you can run " jstat -gc pid" and will find the memory of old grow fast and will not be released 4. i run tpcds for 5 hours and OOM happened > Limit the size of the map used to cache JobConfs to void OOM > > > Key: SPARK-15340 > URL: https://issues.apache.org/jira/browse/SPARK-15340 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.6.0 >Reporter: Zhongshuai Pei >Priority: Critical > > when i run tpcds (orc) by using JDBCServer, driver always OOM. > i find tens of thousands of Jobconf from dump file and these JobConf can not > be recycled, So we should limit the size of the map used to cache JobConfs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16016) where i can find the code of Extreme Learning Machine(elm) on spark
[ https://issues.apache.org/jira/browse/SPARK-16016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337423#comment-15337423 ] yueyou commented on SPARK-16016: you say nothing > where i can find the code of Extreme Learning Machine(elm) on spark > --- > > Key: SPARK-16016 > URL: https://issues.apache.org/jira/browse/SPARK-16016 > Project: Spark > Issue Type: IT Help > Components: MLlib >Affects Versions: 1.6.0 >Reporter: yueyou > Original Estimate: 72h > Remaining Estimate: 72h > > i cann't find the code of Extreme Learning Machine(elm) on spark. someone > help me ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15340) Limit the size of the map used to cache JobConfs to void OOM
[ https://issues.apache.org/jira/browse/SPARK-15340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337421#comment-15337421 ] Zhongshuai Pei commented on SPARK-15340: [~clockfly] 1. I run in the cluster mode on YARN and use spark-sql 2. i run tpcds(500g and must be orc) and set driver.memory 30g 3. it is heap space OOM.you can run " jstat -gc pid" and will find the memory of old grow fast and will not be released 4. i run tpcds for 5 hours and OOM happened > Limit the size of the map used to cache JobConfs to void OOM > > > Key: SPARK-15340 > URL: https://issues.apache.org/jira/browse/SPARK-15340 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.6.0 >Reporter: Zhongshuai Pei >Priority: Critical > > when i run tpcds (orc) by using JDBCServer, driver always OOM. > i find tens of thousands of Jobconf from dump file and these JobConf can not > be recycled, So we should limit the size of the map used to cache JobConfs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16035) The SparseVector parser fails checking for valid end parenthesis
[ https://issues.apache.org/jira/browse/SPARK-16035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337389#comment-15337389 ] Andrea Pasqua commented on SPARK-16035: --- https://github.com/apache/spark/pull/13750 > The SparseVector parser fails checking for valid end parenthesis > > > Key: SPARK-16035 > URL: https://issues.apache.org/jira/browse/SPARK-16035 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.6.1, 2.0.0 >Reporter: Andrea Pasqua >Priority: Minor > > Running > SparseVector.parse(' (4, [0,1 ],[ 4.0,5.0] ') > will not raise an exception as expected, although it parses it as if there > was an end parenthesis. > This can be fixed by replacing > if start == -1: >raise ValueError("Tuple should end with ')'") > with > if end == -1: >raise ValueError("Tuple should end with ')'") > Please see posted PR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-16035) The SparseVector parser fails checking for valid end parenthesis
[ https://issues.apache.org/jira/browse/SPARK-16035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrea Pasqua updated SPARK-16035: -- Comment: was deleted (was: https://github.com/apache/spark/pull/13750) > The SparseVector parser fails checking for valid end parenthesis > > > Key: SPARK-16035 > URL: https://issues.apache.org/jira/browse/SPARK-16035 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.6.1, 2.0.0 >Reporter: Andrea Pasqua >Priority: Minor > > Running > SparseVector.parse(' (4, [0,1 ],[ 4.0,5.0] ') > will not raise an exception as expected, although it parses it as if there > was an end parenthesis. > This can be fixed by replacing > if start == -1: >raise ValueError("Tuple should end with ')'") > with > if end == -1: >raise ValueError("Tuple should end with ')'") > Please see posted PR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16035) The SparseVector parser fails checking for valid end parenthesis
[ https://issues.apache.org/jira/browse/SPARK-16035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16035: Assignee: (was: Apache Spark) > The SparseVector parser fails checking for valid end parenthesis > > > Key: SPARK-16035 > URL: https://issues.apache.org/jira/browse/SPARK-16035 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.6.1, 2.0.0 >Reporter: Andrea Pasqua >Priority: Minor > > Running > SparseVector.parse(' (4, [0,1 ],[ 4.0,5.0] ') > will not raise an exception as expected, although it parses it as if there > was an end parenthesis. > This can be fixed by replacing > if start == -1: >raise ValueError("Tuple should end with ')'") > with > if end == -1: >raise ValueError("Tuple should end with ')'") > Please see posted PR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16035) The SparseVector parser fails checking for valid end parenthesis
[ https://issues.apache.org/jira/browse/SPARK-16035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337388#comment-15337388 ] Apache Spark commented on SPARK-16035: -- User 'andreapasqua' has created a pull request for this issue: https://github.com/apache/spark/pull/13750 > The SparseVector parser fails checking for valid end parenthesis > > > Key: SPARK-16035 > URL: https://issues.apache.org/jira/browse/SPARK-16035 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.6.1, 2.0.0 >Reporter: Andrea Pasqua >Priority: Minor > > Running > SparseVector.parse(' (4, [0,1 ],[ 4.0,5.0] ') > will not raise an exception as expected, although it parses it as if there > was an end parenthesis. > This can be fixed by replacing > if start == -1: >raise ValueError("Tuple should end with ')'") > with > if end == -1: >raise ValueError("Tuple should end with ')'") > Please see posted PR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16035) The SparseVector parser fails checking for valid end parenthesis
[ https://issues.apache.org/jira/browse/SPARK-16035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16035: Assignee: Apache Spark > The SparseVector parser fails checking for valid end parenthesis > > > Key: SPARK-16035 > URL: https://issues.apache.org/jira/browse/SPARK-16035 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.6.1, 2.0.0 >Reporter: Andrea Pasqua >Assignee: Apache Spark >Priority: Minor > > Running > SparseVector.parse(' (4, [0,1 ],[ 4.0,5.0] ') > will not raise an exception as expected, although it parses it as if there > was an end parenthesis. > This can be fixed by replacing > if start == -1: >raise ValueError("Tuple should end with ')'") > with > if end == -1: >raise ValueError("Tuple should end with ')'") > Please see posted PR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16034) Checks the partition columns when calling dataFrame.write.mode("append").saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-16034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16034: Assignee: (was: Apache Spark) > Checks the partition columns when calling > dataFrame.write.mode("append").saveAsTable > > > Key: SPARK-16034 > URL: https://issues.apache.org/jira/browse/SPARK-16034 > Project: Spark > Issue Type: Sub-task >Reporter: Sean Zhong > > Suppose we have defined a partitioned table: > {code} > CREATE TABLE src (a INT, b INT, c INT) > USING PARQUET > PARTITIONED BY (a, b); > {code} > We should check the partition columns when appending DataFrame data to > existing table: > {code} > val df = Seq((1, 2, 3)).toDF("a", "b", "c") > df.write.partitionBy("b", "a").mode("append").saveAsTable("src") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16034) Checks the partition columns when calling dataFrame.write.mode("append").saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-16034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16034: Assignee: Apache Spark > Checks the partition columns when calling > dataFrame.write.mode("append").saveAsTable > > > Key: SPARK-16034 > URL: https://issues.apache.org/jira/browse/SPARK-16034 > Project: Spark > Issue Type: Sub-task >Reporter: Sean Zhong >Assignee: Apache Spark > > Suppose we have defined a partitioned table: > {code} > CREATE TABLE src (a INT, b INT, c INT) > USING PARQUET > PARTITIONED BY (a, b); > {code} > We should check the partition columns when appending DataFrame data to > existing table: > {code} > val df = Seq((1, 2, 3)).toDF("a", "b", "c") > df.write.partitionBy("b", "a").mode("append").saveAsTable("src") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16034) Checks the partition columns when calling dataFrame.write.mode("append").saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-16034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337385#comment-15337385 ] Apache Spark commented on SPARK-16034: -- User 'clockfly' has created a pull request for this issue: https://github.com/apache/spark/pull/13749 > Checks the partition columns when calling > dataFrame.write.mode("append").saveAsTable > > > Key: SPARK-16034 > URL: https://issues.apache.org/jira/browse/SPARK-16034 > Project: Spark > Issue Type: Sub-task >Reporter: Sean Zhong > > Suppose we have defined a partitioned table: > {code} > CREATE TABLE src (a INT, b INT, c INT) > USING PARQUET > PARTITIONED BY (a, b); > {code} > We should check the partition columns when appending DataFrame data to > existing table: > {code} > val df = Seq((1, 2, 3)).toDF("a", "b", "c") > df.write.partitionBy("b", "a").mode("append").saveAsTable("src") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16031) Add debug-only socket source in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-16031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16031: Assignee: Matei Zaharia (was: Apache Spark) > Add debug-only socket source in Structured Streaming > > > Key: SPARK-16031 > URL: https://issues.apache.org/jira/browse/SPARK-16031 > Project: Spark > Issue Type: New Feature > Components: SQL, Streaming >Reporter: Matei Zaharia >Assignee: Matei Zaharia > > This is a debug-only version of SPARK-15842: for tutorials and debugging of > streaming apps, it would be nice to have a text-based socket source similar > to the one in Spark Streaming. It will clearly be marked as debug-only so > that users don't try to run it in production applications, because this type > of source cannot provide HA without storing a lot of state in Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16031) Add debug-only socket source in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-16031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337373#comment-15337373 ] Apache Spark commented on SPARK-16031: -- User 'mateiz' has created a pull request for this issue: https://github.com/apache/spark/pull/13748 > Add debug-only socket source in Structured Streaming > > > Key: SPARK-16031 > URL: https://issues.apache.org/jira/browse/SPARK-16031 > Project: Spark > Issue Type: New Feature > Components: SQL, Streaming >Reporter: Matei Zaharia >Assignee: Matei Zaharia > > This is a debug-only version of SPARK-15842: for tutorials and debugging of > streaming apps, it would be nice to have a text-based socket source similar > to the one in Spark Streaming. It will clearly be marked as debug-only so > that users don't try to run it in production applications, because this type > of source cannot provide HA without storing a lot of state in Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16031) Add debug-only socket source in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-16031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16031: Assignee: Apache Spark (was: Matei Zaharia) > Add debug-only socket source in Structured Streaming > > > Key: SPARK-16031 > URL: https://issues.apache.org/jira/browse/SPARK-16031 > Project: Spark > Issue Type: New Feature > Components: SQL, Streaming >Reporter: Matei Zaharia >Assignee: Apache Spark > > This is a debug-only version of SPARK-15842: for tutorials and debugging of > streaming apps, it would be nice to have a text-based socket source similar > to the one in Spark Streaming. It will clearly be marked as debug-only so > that users don't try to run it in production applications, because this type > of source cannot provide HA without storing a lot of state in Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16038) we can omit partition list when insert into hive table
Wenchen Fan created SPARK-16038: --- Summary: we can omit partition list when insert into hive table Key: SPARK-16038 URL: https://issues.apache.org/jira/browse/SPARK-16038 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16037) use by-position resolution when insert into hive table
Wenchen Fan created SPARK-16037: --- Summary: use by-position resolution when insert into hive table Key: SPARK-16037 URL: https://issues.apache.org/jira/browse/SPARK-16037 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan INSERT INTO TABLE src SELECT 1, 2 AS c, 3 AS b; The result is 1, 3, 2 for hive table, which is wrong -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16035) The SparseVector parser fails checking for valid end parenthesis
[ https://issues.apache.org/jira/browse/SPARK-16035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrea Pasqua updated SPARK-16035: -- Description: Running SparseVector.parse(' (4, [0,1 ],[ 4.0,5.0] ') will not raise an exception as expected, although it parses it as if there was an end parenthesis. This can be fixed by replacing if start == -1: raise ValueError("Tuple should end with ')'") with if end == -1: raise ValueError("Tuple should end with ')'") Please see posted PR was: Running ``` SparseVector.parse(' (4, [0,1 ],[ 4.0,5.0] ') ``` > The SparseVector parser fails checking for valid end parenthesis > > > Key: SPARK-16035 > URL: https://issues.apache.org/jira/browse/SPARK-16035 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.6.1, 2.0.0 >Reporter: Andrea Pasqua >Priority: Minor > > Running > SparseVector.parse(' (4, [0,1 ],[ 4.0,5.0] ') > will not raise an exception as expected, although it parses it as if there > was an end parenthesis. > This can be fixed by replacing > if start == -1: >raise ValueError("Tuple should end with ')'") > with > if end == -1: >raise ValueError("Tuple should end with ')'") > Please see posted PR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16035) The SparseVector parser fails checking for valid end parenthesis
[ https://issues.apache.org/jira/browse/SPARK-16035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrea Pasqua updated SPARK-16035: -- Component/s: PySpark > The SparseVector parser fails checking for valid end parenthesis > > > Key: SPARK-16035 > URL: https://issues.apache.org/jira/browse/SPARK-16035 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.6.1, 2.0.0 >Reporter: Andrea Pasqua >Priority: Minor > > Running > SparseVector.parse(' (4, [0,1 ],[ 4.0,5.0] ') > will not raise an exception as expected, although it parses it as if there > was an end parenthesis. > This can be fixed by replacing > if start == -1: >raise ValueError("Tuple should end with ')'") > with > if end == -1: >raise ValueError("Tuple should end with ')'") > Please see posted PR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16034) Checks the partition columns when calling dataFrame.write.mode("append").saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-16034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-16034: --- Issue Type: Sub-task (was: Bug) Parent: SPARK-16032 > Checks the partition columns when calling > dataFrame.write.mode("append").saveAsTable > > > Key: SPARK-16034 > URL: https://issues.apache.org/jira/browse/SPARK-16034 > Project: Spark > Issue Type: Sub-task >Reporter: Sean Zhong > > Suppose we have defined a partitioned table: > {code} > CREATE TABLE src (a INT, b INT, c INT) > USING PARQUET > PARTITIONED BY (a, b); > {code} > We should check the partition columns when appending DataFrame data to > existing table: > {code} > val df = Seq((1, 2, 3)).toDF("a", "b", "c") > df.write.partitionBy("b", "a").mode("append").saveAsTable("src") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16036) better error message if the number of columns in SELECT clause doesn't match the table schema
Wenchen Fan created SPARK-16036: --- Summary: better error message if the number of columns in SELECT clause doesn't match the table schema Key: SPARK-16036 URL: https://issues.apache.org/jira/browse/SPARK-16036 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan INSERT INTO TABLE src PARTITION(b=2, c=3) SELECT 4, 5, 6; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16035) The SparseVector parser fails checking for valid end parenthesis
[ https://issues.apache.org/jira/browse/SPARK-16035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrea Pasqua updated SPARK-16035: -- Description: Running ``` SparseVector.parse(' (4, [0,1 ],[ 4.0,5.0] ') ``` > The SparseVector parser fails checking for valid end parenthesis > > > Key: SPARK-16035 > URL: https://issues.apache.org/jira/browse/SPARK-16035 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1, 2.0.0 >Reporter: Andrea Pasqua >Priority: Minor > > Running > ``` > SparseVector.parse(' (4, [0,1 ],[ 4.0,5.0] ') > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16035) The SparseVector parser fails checking for valid end parenthesis
Andrea Pasqua created SPARK-16035: - Summary: The SparseVector parser fails checking for valid end parenthesis Key: SPARK-16035 URL: https://issues.apache.org/jira/browse/SPARK-16035 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.6.1, 2.0.0 Reporter: Andrea Pasqua Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16034) Checks the partition columns when calling dataFrame.write.mode("append").saveAsTable
Sean Zhong created SPARK-16034: -- Summary: Checks the partition columns when calling dataFrame.write.mode("append").saveAsTable Key: SPARK-16034 URL: https://issues.apache.org/jira/browse/SPARK-16034 Project: Spark Issue Type: Bug Reporter: Sean Zhong Suppose we have defined a partitioned table: {code} CREATE TABLE src (a INT, b INT, c INT) USING PARQUET PARTITIONED BY (a, b); {code} We should check the partition columns when appending DataFrame data to existing table: {code} val df = Seq((1, 2, 3)).toDF("a", "b", "c") df.write.partitionBy("b", "a").mode("append").saveAsTable("src") {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16033) DataFrameWriter.partitionBy() can't be used together with DataFrameWriter.insertInto()
[ https://issues.apache.org/jira/browse/SPARK-16033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16033: Assignee: Cheng Lian (was: Apache Spark) > DataFrameWriter.partitionBy() can't be used together with > DataFrameWriter.insertInto() > -- > > Key: SPARK-16033 > URL: https://issues.apache.org/jira/browse/SPARK-16033 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > When inserting into an existing partitioned table, partitioning columns > should always be determined by catalog metadata of the existing table to be > inserted. Extra {{partitionBy()}} calls don't make sense, and mess up > existing data because newly inserted data may have wrong partitioning > directory layout. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16033) DataFrameWriter.partitionBy() can't be used together with DataFrameWriter.insertInto()
[ https://issues.apache.org/jira/browse/SPARK-16033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16033: Assignee: Apache Spark (was: Cheng Lian) > DataFrameWriter.partitionBy() can't be used together with > DataFrameWriter.insertInto() > -- > > Key: SPARK-16033 > URL: https://issues.apache.org/jira/browse/SPARK-16033 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Apache Spark > > When inserting into an existing partitioned table, partitioning columns > should always be determined by catalog metadata of the existing table to be > inserted. Extra {{partitionBy()}} calls don't make sense, and mess up > existing data because newly inserted data may have wrong partitioning > directory layout. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16030) Allow specifying static partitions in an INSERT statement for data source tables
[ https://issues.apache.org/jira/browse/SPARK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16030: Assignee: Apache Spark (was: Yin Huai) > Allow specifying static partitions in an INSERT statement for data source > tables > > > Key: SPARK-16030 > URL: https://issues.apache.org/jira/browse/SPARK-16030 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16030) Allow specifying static partitions in an INSERT statement for data source tables
[ https://issues.apache.org/jira/browse/SPARK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337341#comment-15337341 ] Apache Spark commented on SPARK-16030: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/13746 > Allow specifying static partitions in an INSERT statement for data source > tables > > > Key: SPARK-16030 > URL: https://issues.apache.org/jira/browse/SPARK-16030 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16030) Allow specifying static partitions in an INSERT statement for data source tables
[ https://issues.apache.org/jira/browse/SPARK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16030: Assignee: Yin Huai (was: Apache Spark) > Allow specifying static partitions in an INSERT statement for data source > tables > > > Key: SPARK-16030 > URL: https://issues.apache.org/jira/browse/SPARK-16030 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16033) DataFrameWriter.partitionBy() can't be used together with DataFrameWriter.insertInto()
[ https://issues.apache.org/jira/browse/SPARK-16033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337343#comment-15337343 ] Apache Spark commented on SPARK-16033: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/13747 > DataFrameWriter.partitionBy() can't be used together with > DataFrameWriter.insertInto() > -- > > Key: SPARK-16033 > URL: https://issues.apache.org/jira/browse/SPARK-16033 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > When inserting into an existing partitioned table, partitioning columns > should always be determined by catalog metadata of the existing table to be > inserted. Extra {{partitionBy()}} calls don't make sense, and mess up > existing data because newly inserted data may have wrong partitioning > directory layout. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15997) Audit ml.feature Update documentation for ml feature transformers
[ https://issues.apache.org/jira/browse/SPARK-15997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337335#comment-15337335 ] Gayathri Murali commented on SPARK-15997: - https://github.com/apache/spark/pull/13745 - This is right link to the PR > Audit ml.feature Update documentation for ml feature transformers > - > > Key: SPARK-15997 > URL: https://issues.apache.org/jira/browse/SPARK-15997 > Project: Spark > Issue Type: Documentation > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: Gayathri Murali >Assignee: Gayathri Murali > > This JIRA is a subtask of SPARK-15100 and improves documentation for new > features added to > 1. HashingTF > 2. Countvectorizer > 3. QuantileDiscretizer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16033) DataFrameWriter.partitionBy() can't be used together with DataFrameWriter.insertInto()
[ https://issues.apache.org/jira/browse/SPARK-16033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-16033: --- Issue Type: Sub-task (was: Bug) Parent: SPARK-16032 > DataFrameWriter.partitionBy() can't be used together with > DataFrameWriter.insertInto() > -- > > Key: SPARK-16033 > URL: https://issues.apache.org/jira/browse/SPARK-16033 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > When inserting into an existing partitioned table, partitioning columns > should always be determined by catalog metadata of the existing table to be > inserted. Extra {{partitionBy()}} calls don't make sense, and mess up > existing data because newly inserted data may have wrong partitioning > directory layout. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16033) DataFrameWriter.partitionBy() can't be used together with DataFrameWriter.insertInto()
Cheng Lian created SPARK-16033: -- Summary: DataFrameWriter.partitionBy() can't be used together with DataFrameWriter.insertInto() Key: SPARK-16033 URL: https://issues.apache.org/jira/browse/SPARK-16033 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian When inserting into an existing partitioned table, partitioning columns should always be determined by catalog metadata of the existing table to be inserted. Extra {{partitionBy()}} calls don't make sense, and mess up existing data because newly inserted data may have wrong partitioning directory layout. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16030) Allow specifying static partitions in an INSERT statement for data source tables
[ https://issues.apache.org/jira/browse/SPARK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-16030: --- Assignee: Yin Huai > Allow specifying static partitions in an INSERT statement for data source > tables > > > Key: SPARK-16030 > URL: https://issues.apache.org/jira/browse/SPARK-16030 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16032) Audit semantics of various insertion operations related to partitioned tables
Cheng Lian created SPARK-16032: -- Summary: Audit semantics of various insertion operations related to partitioned tables Key: SPARK-16032 URL: https://issues.apache.org/jira/browse/SPARK-16032 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Cheng Lian Assignee: Wenchen Fan Priority: Blocker We found that semantics of various insertion operations related to partition tables can be inconsistent. This is an umbrella ticket for all related tickets. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16030) Allow specifying static partitions in an INSERT statement for data source tables
[ https://issues.apache.org/jira/browse/SPARK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-16030: --- Issue Type: Sub-task (was: Bug) Parent: SPARK-16032 > Allow specifying static partitions in an INSERT statement for data source > tables > > > Key: SPARK-16030 > URL: https://issues.apache.org/jira/browse/SPARK-16030 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15997) Audit ml.feature Update documentation for ml feature transformers
[ https://issues.apache.org/jira/browse/SPARK-15997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15997: Assignee: Gayathri Murali (was: Apache Spark) > Audit ml.feature Update documentation for ml feature transformers > - > > Key: SPARK-15997 > URL: https://issues.apache.org/jira/browse/SPARK-15997 > Project: Spark > Issue Type: Documentation > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: Gayathri Murali >Assignee: Gayathri Murali > > This JIRA is a subtask of SPARK-15100 and improves documentation for new > features added to > 1. HashingTF > 2. Countvectorizer > 3. QuantileDiscretizer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15997) Audit ml.feature Update documentation for ml feature transformers
[ https://issues.apache.org/jira/browse/SPARK-15997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337285#comment-15337285 ] Apache Spark commented on SPARK-15997: -- User 'GayathriMurali' has created a pull request for this issue: https://github.com/apache/spark/pull/13176 > Audit ml.feature Update documentation for ml feature transformers > - > > Key: SPARK-15997 > URL: https://issues.apache.org/jira/browse/SPARK-15997 > Project: Spark > Issue Type: Documentation > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: Gayathri Murali >Assignee: Gayathri Murali > > This JIRA is a subtask of SPARK-15100 and improves documentation for new > features added to > 1. HashingTF > 2. Countvectorizer > 3. QuantileDiscretizer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15997) Audit ml.feature Update documentation for ml feature transformers
[ https://issues.apache.org/jira/browse/SPARK-15997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15997: Assignee: Apache Spark (was: Gayathri Murali) > Audit ml.feature Update documentation for ml feature transformers > - > > Key: SPARK-15997 > URL: https://issues.apache.org/jira/browse/SPARK-15997 > Project: Spark > Issue Type: Documentation > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: Gayathri Murali >Assignee: Apache Spark > > This JIRA is a subtask of SPARK-15100 and improves documentation for new > features added to > 1. HashingTF > 2. Countvectorizer > 3. QuantileDiscretizer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16030) Allow specifying static partitions in an INSERT statement for data source tables
[ https://issues.apache.org/jira/browse/SPARK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-16030: - Priority: Critical (was: Major) > Allow specifying static partitions in an INSERT statement for data source > tables > > > Key: SPARK-16030 > URL: https://issues.apache.org/jira/browse/SPARK-16030 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15916) JDBC AND/OR operator push down does not respect lower OR operator precedence
[ https://issues.apache.org/jira/browse/SPARK-15916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-15916: --- Description: A table from SQL server Northwind database was registered as a JDBC dataframe. A query was executed on Spark SQL, the {{northwind_dbo_Categories}} table is a temporary table which is a JDBC dataframe to {{\[northwind\].\[dbo\].\[Categories\]}} SQL server table: SQL executed on Spark sql context: {code:sql} SELECT CategoryID FROM northwind_dbo_Categories WHERE (CategoryID = 1 OR CategoryID = 2) AND CategoryName = 'Beverages' {code} Spark has done a proper predicate pushdown to JDBC, however parenthesis around two {{OR}} conditions was removed. Instead the following query was sent over JDBC to SQL Server: {code:sql} SELECT "CategoryID" FROM [northwind].[dbo].[Categories] WHERE (CategoryID = 1) OR (CategoryID = 2) AND CategoryName = 'Beverages' {code} As a result, the last two conditions (around the AND operator) were considered as the highest precedence: {{(CategoryID = 2) AND CategoryName = 'Beverages'}} Finally SQL Server has executed a query like this: {code:sql} SELECT "CategoryID" FROM [northwind].[dbo].[Categories] WHERE CategoryID = 1 OR (CategoryID = 2 AND CategoryName = 'Beverages') {code} was: A table from sql server Northwind database was registered as a JDBC dataframe. A query was executed on Spark SQL, the "northwind_dbo_Categories" table is a temporary table which is a JDBC dataframe to "[northwind].[dbo].[Categories]" sql server table: SQL executed on Spark sql context: SELECT CategoryID FROM northwind_dbo_Categories WHERE (CategoryID = 1 OR CategoryID = 2) AND CategoryName = 'Beverages' Spark has done a proper predicate pushdown to JDBC, however parenthesis around two OR conditions was removed. Instead the following query was sent over JDBC to SQL Server: SELECT "CategoryID" FROM [northwind].[dbo].[Categories] WHERE (CategoryID = 1) OR (CategoryID = 2) AND CategoryName = 'Beverages' As a result, the last two conditions (around the AND operator) were considered as the highest precedence: (CategoryID = 2) AND CategoryName = 'Beverages' Finally SQL Server has executed a query like this: SELECT "CategoryID" FROM [northwind].[dbo].[Categories] WHERE CategoryID = 1 OR (CategoryID = 2 AND CategoryName = 'Beverages') > JDBC AND/OR operator push down does not respect lower OR operator precedence > > > Key: SPARK-15916 > URL: https://issues.apache.org/jira/browse/SPARK-15916 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Piotr Czarnas >Assignee: Hyukjin Kwon > Fix For: 2.0.0 > > > A table from SQL server Northwind database was registered as a JDBC dataframe. > A query was executed on Spark SQL, the {{northwind_dbo_Categories}} table is > a temporary table which is a JDBC dataframe to > {{\[northwind\].\[dbo\].\[Categories\]}} SQL server table: > SQL executed on Spark sql context: > {code:sql} > SELECT CategoryID FROM northwind_dbo_Categories > WHERE (CategoryID = 1 OR CategoryID = 2) AND CategoryName = 'Beverages' > {code} > Spark has done a proper predicate pushdown to JDBC, however parenthesis > around two {{OR}} conditions was removed. Instead the following query was > sent over JDBC to SQL Server: > {code:sql} > SELECT "CategoryID" FROM [northwind].[dbo].[Categories] WHERE (CategoryID = > 1) OR (CategoryID = 2) AND CategoryName = 'Beverages' > {code} > As a result, the last two conditions (around the AND operator) were > considered as the highest precedence: {{(CategoryID = 2) AND CategoryName = > 'Beverages'}} > Finally SQL Server has executed a query like this: > {code:sql} > SELECT "CategoryID" FROM [northwind].[dbo].[Categories] WHERE CategoryID = 1 > OR (CategoryID = 2 AND CategoryName = 'Beverages') > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15984) WARN message "o.a.h.y.s.resourcemanager.rmapp.RMAppImpl: The specific max attempts: 0 for application: 8 is invalid" when starting application on YARN
[ https://issues.apache.org/jira/browse/SPARK-15984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337278#comment-15337278 ] Saisai Shao commented on SPARK-15984: - Is there any problem? I guess you might set max app attempt to 0, so you will get such warning log, it is illegal to set 0 for max app attempt, it should be >= 1. Here is the yarn code. {code} int globalMaxAppAttempts = conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS, YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS); int individualMaxAppAttempts = submissionContext.getMaxAppAttempts(); if (individualMaxAppAttempts <= 0 || individualMaxAppAttempts > globalMaxAppAttempts) { this.maxAppAttempts = globalMaxAppAttempts; LOG.warn("The specific max attempts: " + individualMaxAppAttempts + " for application: " + applicationId.getId() + " is invalid, because it is out of the range [1, " + globalMaxAppAttempts + "]. Use the global max attempts instead."); } else { this.maxAppAttempts = individualMaxAppAttempts; } {code} > WARN message "o.a.h.y.s.resourcemanager.rmapp.RMAppImpl: The specific max > attempts: 0 for application: 8 is invalid" when starting application on YARN > -- > > Key: SPARK-15984 > URL: https://issues.apache.org/jira/browse/SPARK-15984 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Priority: Minor > > When executing {{spark-shell}} on Spark on YARN 2.7.2 on Mac OS as follows: > {code} > YARN_CONF_DIR=hadoop-conf ./bin/spark-shell --master yarn -c > spark.shuffle.service.enabled=true --deploy-mode client -c > spark.scheduler.mode=FAIR > {code} > it ends up with the following WARN in the logs: > {code} > 2016-06-16 08:33:05,308 INFO > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated new > applicationId: 8 > 2016-06-16 08:33:07,305 WARN > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: The specific > max attempts: 0 for application: 8 is invalid, because it is out of the range > [1, 2]. Use the global max attempts instead. > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org