[jira] [Assigned] (SPARK-21690) one-pass imputer
[ https://issues.apache.org/jira/browse/SPARK-21690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21690: Assignee: Apache Spark (was: zhengruifeng) > one-pass imputer > > > Key: SPARK-21690 > URL: https://issues.apache.org/jira/browse/SPARK-21690 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.1 >Reporter: zhengruifeng >Assignee: Apache Spark > > {code} > val surrogates = $(inputCols).map { inputCol => > val ic = col(inputCol) > val filtered = dataset.select(ic.cast(DoubleType)) > .filter(ic.isNotNull && ic =!= $(missingValue) && !ic.isNaN) > if(filtered.take(1).length == 0) { > throw new SparkException(s"surrogate cannot be computed. " + > s"All the values in $inputCol are Null, Nan or > missingValue(${$(missingValue)})") > } > val surrogate = $(strategy) match { > case Imputer.mean => filtered.select(avg(inputCol)).as[Double].first() > case Imputer.median => filtered.stat.approxQuantile(inputCol, > Array(0.5), 0.001).head > } > surrogate > } > {code} > Current impl of {{Imputer}} process one column after after another. In this > place, we should parallelize the processing in a more efficient way. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21690) one-pass imputer
[ https://issues.apache.org/jira/browse/SPARK-21690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21690: Assignee: zhengruifeng (was: Apache Spark) > one-pass imputer > > > Key: SPARK-21690 > URL: https://issues.apache.org/jira/browse/SPARK-21690 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.1 >Reporter: zhengruifeng >Assignee: zhengruifeng > > {code} > val surrogates = $(inputCols).map { inputCol => > val ic = col(inputCol) > val filtered = dataset.select(ic.cast(DoubleType)) > .filter(ic.isNotNull && ic =!= $(missingValue) && !ic.isNaN) > if(filtered.take(1).length == 0) { > throw new SparkException(s"surrogate cannot be computed. " + > s"All the values in $inputCol are Null, Nan or > missingValue(${$(missingValue)})") > } > val surrogate = $(strategy) match { > case Imputer.mean => filtered.select(avg(inputCol)).as[Double].first() > case Imputer.median => filtered.stat.approxQuantile(inputCol, > Array(0.5), 0.001).head > } > surrogate > } > {code} > Current impl of {{Imputer}} process one column after after another. In this > place, we should parallelize the processing in a more efficient way. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21858) Make Spark grouping_id() compatible with Hive grouping__id
Yann Byron created SPARK-21858: -- Summary: Make Spark grouping_id() compatible with Hive grouping__id Key: SPARK-21858 URL: https://issues.apache.org/jira/browse/SPARK-21858 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Yann Byron If you want to migrate some ETLs using `grouping__id` in Hive to Spark and use Spark `grouping_id()` instead of Hive `grouping__id`, you will find difference between their evaluations. Here is an example. {code:java} select A, B, grouping__id/grouping_id() from t group by A, B grouping sets((), (A), (B), (A,B)) {code} Running it on Hive and Spark separately, you'll find this: (the selected attribute in selected grouping set is represented by (/) and otherwise by (x)) ||A B||Binary Expression in Spark||Spark||Hive||Binary Expression in Hive||B A|| |(x) (x)|11|3|0|00|(x) (x)| |(x) (/)|10|2|2|10|(/) (x)| |(/) (x)|01|1|1|01|(x) (/)| |(/) (/)|00|0|3|11|(/) (/)| As shown above,In Hive, (/) set to 0, (x) set to 1, and in Spark it's opposite. Moreover, attributes in `group by` will reverse firstly in Hive. In Spark it'll be evaluated directly. In my opinion, I suggest that modifying the behavior of `grouping_id()` make it compatible with Hive `grouping__id`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21857) Exception in thread "main" java.lang.ExceptionInInitializerError
Nagamanoj created SPARK-21857: - Summary: Exception in thread "main" java.lang.ExceptionInInitializerError Key: SPARK-21857 URL: https://issues.apache.org/jira/browse/SPARK-21857 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.2.0 Reporter: Nagamanoj After installing SPRAK using prebuilt version, when we run ./bin/pySpark JAVA Version = Java 9 I'm getting the following exception sing Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 17/08/28 20:06:43 INFO SparkContext: Running Spark version 2.2.0 Exception in thread "main" java.lang.ExceptionInInitializerError at org.apache.hadoop.util.StringUtils.(StringUtils.java:80) at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261) at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634) at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2430) at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2430) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2430) at org.apache.spark.SparkContext.(SparkContext.scala:295) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509) at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:909) at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:901) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901) at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31) at org.apache.spark.examples.SparkPi.main(SparkPi.scala) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:564) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 1 at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3116) at java.base/java.lang.String.substring(String.java:1885) at org.apache.hadoop.util.Shell.(Shell.java:52 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21855) When submit job to yarn and add file multiple times,we should log error instead of warning
[ https://issues.apache.org/jira/browse/SPARK-21855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated SPARK-21855: - Component/s: (was: Deploy) YARN > When submit job to yarn and add file multiple times,we should log error > instead of warning > -- > > Key: SPARK-21855 > URL: https://issues.apache.org/jira/browse/SPARK-21855 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.2.0 >Reporter: zhoukang >Priority: Trivial > > Now when submit job with yarn,and upload same file multiple times.We will > throw exception but logging level is warn. > {code:java} > 17/08/29 11:17:37 WARN yarn.Client: Resource > hdfs://tjwqtst-galaxy/spark/tjwqtst-transfer/scripts/oom/oom_script.sh added > multiple times to distributed cache. > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel
[ https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144719#comment-16144719 ] Ming Jiang commented on SPARK-21856: I can work on it, thanks! > Update Python API for MultilayerPerceptronClassifierModel > - > > Key: SPARK-21856 > URL: https://issues.apache.org/jira/browse/SPARK-21856 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Weichen Xu > > SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so > python API also need update. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel
Weichen Xu created SPARK-21856: -- Summary: Update Python API for MultilayerPerceptronClassifierModel Key: SPARK-21856 URL: https://issues.apache.org/jira/browse/SPARK-21856 Project: Spark Issue Type: New Feature Components: ML, PySpark Affects Versions: 2.3.0 Reporter: Weichen Xu SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so python API also need update. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21855) When submit job to yarn and add file multiple times,we should log error instead of warning
[ https://issues.apache.org/jira/browse/SPARK-21855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21855: Assignee: Apache Spark > When submit job to yarn and add file multiple times,we should log error > instead of warning > -- > > Key: SPARK-21855 > URL: https://issues.apache.org/jira/browse/SPARK-21855 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.2.0 >Reporter: zhoukang >Assignee: Apache Spark >Priority: Trivial > > Now when submit job with yarn,and upload same file multiple times.We will > throw exception but logging level is warn. > {code:java} > 17/08/29 11:17:37 WARN yarn.Client: Resource > hdfs://tjwqtst-galaxy/spark/tjwqtst-transfer/scripts/oom/oom_script.sh added > multiple times to distributed cache. > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21855) When submit job to yarn and add file multiple times,we should log error instead of warning
[ https://issues.apache.org/jira/browse/SPARK-21855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21855: Assignee: (was: Apache Spark) > When submit job to yarn and add file multiple times,we should log error > instead of warning > -- > > Key: SPARK-21855 > URL: https://issues.apache.org/jira/browse/SPARK-21855 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.2.0 >Reporter: zhoukang >Priority: Trivial > > Now when submit job with yarn,and upload same file multiple times.We will > throw exception but logging level is warn. > {code:java} > 17/08/29 11:17:37 WARN yarn.Client: Resource > hdfs://tjwqtst-galaxy/spark/tjwqtst-transfer/scripts/oom/oom_script.sh added > multiple times to distributed cache. > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21855) When submit job to yarn and add file multiple times,we should log error instead of warning
[ https://issues.apache.org/jira/browse/SPARK-21855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144710#comment-16144710 ] Apache Spark commented on SPARK-21855: -- User 'caneGuy' has created a pull request for this issue: https://github.com/apache/spark/pull/19073 > When submit job to yarn and add file multiple times,we should log error > instead of warning > -- > > Key: SPARK-21855 > URL: https://issues.apache.org/jira/browse/SPARK-21855 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.2.0 >Reporter: zhoukang >Priority: Trivial > > Now when submit job with yarn,and upload same file multiple times.We will > throw exception but logging level is warn. > {code:java} > 17/08/29 11:17:37 WARN yarn.Client: Resource > hdfs://tjwqtst-galaxy/spark/tjwqtst-transfer/scripts/oom/oom_script.sh added > multiple times to distributed cache. > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21855) When submit job to yarn and add file multiple times,we should log error instead of warning
[ https://issues.apache.org/jira/browse/SPARK-21855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated SPARK-21855: - Description: Now when submit job with yarn,and upload same file multiple times.We will throw exception but logging level is warn. {code:java} 17/08/29 11:17:37 WARN yarn.Client: Resource hdfs://tjwqtst-galaxy/spark/tjwqtst-transfer/scripts/oom/oom_script.sh added multiple times to distributed cache. {code} > When submit job to yarn and add file multiple times,we should log error > instead of warning > -- > > Key: SPARK-21855 > URL: https://issues.apache.org/jira/browse/SPARK-21855 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.2.0 >Reporter: zhoukang >Priority: Minor > > Now when submit job with yarn,and upload same file multiple times.We will > throw exception but logging level is warn. > {code:java} > 17/08/29 11:17:37 WARN yarn.Client: Resource > hdfs://tjwqtst-galaxy/spark/tjwqtst-transfer/scripts/oom/oom_script.sh added > multiple times to distributed cache. > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21855) When submit job to yarn and add file multiple times,we should log error instead of warning
[ https://issues.apache.org/jira/browse/SPARK-21855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated SPARK-21855: - Priority: Trivial (was: Minor) > When submit job to yarn and add file multiple times,we should log error > instead of warning > -- > > Key: SPARK-21855 > URL: https://issues.apache.org/jira/browse/SPARK-21855 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.2.0 >Reporter: zhoukang >Priority: Trivial > > Now when submit job with yarn,and upload same file multiple times.We will > throw exception but logging level is warn. > {code:java} > 17/08/29 11:17:37 WARN yarn.Client: Resource > hdfs://tjwqtst-galaxy/spark/tjwqtst-transfer/scripts/oom/oom_script.sh added > multiple times to distributed cache. > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21855) When submit job to yarn and add file multiple times,we should log error instead of warning
zhoukang created SPARK-21855: Summary: When submit job to yarn and add file multiple times,we should log error instead of warning Key: SPARK-21855 URL: https://issues.apache.org/jira/browse/SPARK-21855 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 2.2.0 Reporter: zhoukang Priority: Minor -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21853) Getting an exception while calling the except method on the dataframe
[ https://issues.apache.org/jira/browse/SPARK-21853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shailesh Kini updated SPARK-21853: -- Issue Type: Bug (was: Question) > Getting an exception while calling the except method on the dataframe > - > > Key: SPARK-21853 > URL: https://issues.apache.org/jira/browse/SPARK-21853 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.1.1 >Reporter: Shailesh Kini > Attachments: SparkException.txt > > > I am getting an exception while calling except on the Dataset. > org.apache.spark.sql.AnalysisException: resolved attribute(s) > SVC_BILLING_PERIOD#37723 missing from > I read 2 csv files into datasets DS1 and DS2, which I join (full outer) to > create DS3. DS3 has some rows which are similar with the exception of one > column. I need to isolate those rows and remove the similar rows. I use > groupBy with the count > 1 on a few columns in DS3 to get those similar rows > - dataset DS4. DS4 has only a few columns and not all so I join it back with > DS3 on the aggregate columns to get a new dataset DS5 which has the same > columns as DS3. To get a clean dataset without any of those similar rows, I > am calling DS3.except(DS5) which throws the exception. The attribute is one > of the filtering criteria I use which creating DS1. > Attaching the exception to this ticket. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21853) Getting an exception while calling the except method on the dataframe
[ https://issues.apache.org/jira/browse/SPARK-21853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144697#comment-16144697 ] Shailesh Kini commented on SPARK-21853: --- As a work around, I saved the dataset DS3 into parquet format and read it again after which I was able to successfully call the except. > Getting an exception while calling the except method on the dataframe > - > > Key: SPARK-21853 > URL: https://issues.apache.org/jira/browse/SPARK-21853 > Project: Spark > Issue Type: Question > Components: Spark Shell >Affects Versions: 2.1.1 >Reporter: Shailesh Kini > Attachments: SparkException.txt > > > I am getting an exception while calling except on the Dataset. > org.apache.spark.sql.AnalysisException: resolved attribute(s) > SVC_BILLING_PERIOD#37723 missing from > I read 2 csv files into datasets DS1 and DS2, which I join (full outer) to > create DS3. DS3 has some rows which are similar with the exception of one > column. I need to isolate those rows and remove the similar rows. I use > groupBy with the count > 1 on a few columns in DS3 to get those similar rows > - dataset DS4. DS4 has only a few columns and not all so I join it back with > DS3 on the aggregate columns to get a new dataset DS5 which has the same > columns as DS3. To get a clean dataset without any of those similar rows, I > am calling DS3.except(DS5) which throws the exception. The attribute is one > of the filtering criteria I use which creating DS1. > Attaching the exception to this ticket. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17133) Improvements to linear methods in Spark
[ https://issues.apache.org/jira/browse/SPARK-17133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17133: Assignee: Apache Spark > Improvements to linear methods in Spark > --- > > Key: SPARK-17133 > URL: https://issues.apache.org/jira/browse/SPARK-17133 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Seth Hendrickson >Assignee: Apache Spark > > This JIRA is for tracking several improvements that we should make to > Linear/Logistic regression in Spark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17133) Improvements to linear methods in Spark
[ https://issues.apache.org/jira/browse/SPARK-17133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144695#comment-16144695 ] Apache Spark commented on SPARK-17133: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/19072 > Improvements to linear methods in Spark > --- > > Key: SPARK-17133 > URL: https://issues.apache.org/jira/browse/SPARK-17133 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Seth Hendrickson > > This JIRA is for tracking several improvements that we should make to > Linear/Logistic regression in Spark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17133) Improvements to linear methods in Spark
[ https://issues.apache.org/jira/browse/SPARK-17133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17133: Assignee: (was: Apache Spark) > Improvements to linear methods in Spark > --- > > Key: SPARK-17133 > URL: https://issues.apache.org/jira/browse/SPARK-17133 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Seth Hendrickson > > This JIRA is for tracking several improvements that we should make to > Linear/Logistic regression in Spark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21839) Support SQL config for ORC compression
[ https://issues.apache.org/jira/browse/SPARK-21839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21839: Assignee: Apache Spark > Support SQL config for ORC compression > --- > > Key: SPARK-21839 > URL: https://issues.apache.org/jira/browse/SPARK-21839 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark > > This issue aims to provide `spark.sql.orc.compression.codec` like > `spark.sql.parquet.compression.codec`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21839) Support SQL config for ORC compression
[ https://issues.apache.org/jira/browse/SPARK-21839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21839: Assignee: (was: Apache Spark) > Support SQL config for ORC compression > --- > > Key: SPARK-21839 > URL: https://issues.apache.org/jira/browse/SPARK-21839 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun > > This issue aims to provide `spark.sql.orc.compression.codec` like > `spark.sql.parquet.compression.codec`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21853) Getting an exception while calling the except method on the dataframe
[ https://issues.apache.org/jira/browse/SPARK-21853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shailesh Kini updated SPARK-21853: -- Description: I am getting an exception while calling except on the Dataset. org.apache.spark.sql.AnalysisException: resolved attribute(s) SVC_BILLING_PERIOD#37723 missing from I read 2 csv files into datasets DS1 and DS2, which I join (full outer) to create DS3. DS3 has some rows which are similar with the exception of one column. I need to isolate those rows and remove the similar rows. I use groupBy with the count > 1 on a few columns in DS3 to get those similar rows - dataset DS4. DS4 has only a few columns and not all so I join it back with DS3 on the aggregate columns to get a new dataset DS5 which has the same columns as DS3. To get a clean dataset without any of those similar rows, I am calling DS3.except(DS5) which throws the exception. The attribute is one of the filtering criteria I use which creating DS1. Attaching the exception to this ticket. was: I am getting an exception while calling except on the Dataset. org.apache.spark.sql.AnalysisException: resolved attribute(s) SVC_BILLING_PERIOD#37723 missing from I have 2 csv. I create two dataset DS1 and DS2, which I join to create DS3. I need to filter out duplicates for further processing. I aggregate on DS3 dataset by on some columns and filter when the count > 1. This is DS4. I now join DS3 with DS4 on those columns and get DS5. DS5 has the same structure as DS3 as I drop the columns from the join. DS5 now has all the rows which are duplicate. I then call the except on DS3 to get me a dataset DS5 which all the rows not in DS5. I am planning to filter and remove of of the duplicates (all the columns are not duplicates so I need to use filter) and union it with DS6 to get the dataset free of duplicates. Attaching the exception to this ticket. > Getting an exception while calling the except method on the dataframe > - > > Key: SPARK-21853 > URL: https://issues.apache.org/jira/browse/SPARK-21853 > Project: Spark > Issue Type: Question > Components: Spark Shell >Affects Versions: 2.1.1 >Reporter: Shailesh Kini > Attachments: SparkException.txt > > > I am getting an exception while calling except on the Dataset. > org.apache.spark.sql.AnalysisException: resolved attribute(s) > SVC_BILLING_PERIOD#37723 missing from > I read 2 csv files into datasets DS1 and DS2, which I join (full outer) to > create DS3. DS3 has some rows which are similar with the exception of one > column. I need to isolate those rows and remove the similar rows. I use > groupBy with the count > 1 on a few columns in DS3 to get those similar rows > - dataset DS4. DS4 has only a few columns and not all so I join it back with > DS3 on the aggregate columns to get a new dataset DS5 which has the same > columns as DS3. To get a clean dataset without any of those similar rows, I > am calling DS3.except(DS5) which throws the exception. The attribute is one > of the filtering criteria I use which creating DS1. > Attaching the exception to this ticket. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21854) Python interface for MLOR summary
Weichen Xu created SPARK-21854: -- Summary: Python interface for MLOR summary Key: SPARK-21854 URL: https://issues.apache.org/jira/browse/SPARK-21854 Project: Spark Issue Type: New Feature Components: ML, PySpark Affects Versions: 2.3.0 Reporter: Weichen Xu Python interface for MLOR summary -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21834) Incorrect executor request in case of dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-21834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21834: Assignee: Apache Spark > Incorrect executor request in case of dynamic allocation > > > Key: SPARK-21834 > URL: https://issues.apache.org/jira/browse/SPARK-21834 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Sital Kedia >Assignee: Apache Spark > > killExecutor api currently does not allow killing an executor without > updating the total number of executors needed. In case of dynamic allocation > is turned on and the allocator tries to kill an executor, the scheduler > reduces the total number of executors needed ( see > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L635) > which is incorrect because the allocator already takes care of setting the > required number of executors itself. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21834) Incorrect executor request in case of dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-21834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21834: Assignee: (was: Apache Spark) > Incorrect executor request in case of dynamic allocation > > > Key: SPARK-21834 > URL: https://issues.apache.org/jira/browse/SPARK-21834 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Sital Kedia > > killExecutor api currently does not allow killing an executor without > updating the total number of executors needed. In case of dynamic allocation > is turned on and the allocator tries to kill an executor, the scheduler > reduces the total number of executors needed ( see > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L635) > which is incorrect because the allocator already takes care of setting the > required number of executors itself. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21801) SparkR unit test randomly fail on trees
[ https://issues.apache.org/jira/browse/SPARK-21801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21801: Assignee: (was: Apache Spark) > SparkR unit test randomly fail on trees > --- > > Key: SPARK-21801 > URL: https://issues.apache.org/jira/browse/SPARK-21801 > Project: Spark > Issue Type: Bug > Components: SparkR, Tests >Affects Versions: 2.2.0 >Reporter: Weichen Xu >Priority: Critical > > SparkR unit test sometimes will randomly occur such error: > ``` > 1. Error: spark.randomForest (@test_mllib_tree.R#236) > -- > java.lang.IllegalArgumentException: requirement failed: The input column > stridx_87ea3065aeb2 should have at least two distinct values. > ``` > or > ``` > 1. Error: spark.decisionTree (@test_mllib_tree.R#353) > -- > java.lang.IllegalArgumentException: requirement failed: The input column > stridx_d6a0b492cfa1 should have at least two distinct values. > ``` -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21801) SparkR unit test randomly fail on trees
[ https://issues.apache.org/jira/browse/SPARK-21801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21801: Assignee: Apache Spark > SparkR unit test randomly fail on trees > --- > > Key: SPARK-21801 > URL: https://issues.apache.org/jira/browse/SPARK-21801 > Project: Spark > Issue Type: Bug > Components: SparkR, Tests >Affects Versions: 2.2.0 >Reporter: Weichen Xu >Assignee: Apache Spark >Priority: Critical > > SparkR unit test sometimes will randomly occur such error: > ``` > 1. Error: spark.randomForest (@test_mllib_tree.R#236) > -- > java.lang.IllegalArgumentException: requirement failed: The input column > stridx_87ea3065aeb2 should have at least two distinct values. > ``` > or > ``` > 1. Error: spark.decisionTree (@test_mllib_tree.R#353) > -- > java.lang.IllegalArgumentException: requirement failed: The input column > stridx_d6a0b492cfa1 should have at least two distinct values. > ``` -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21852) Empty Parquet Files created as a result of spark jobs fail when read
[ https://issues.apache.org/jira/browse/SPARK-21852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144650#comment-16144650 ] Hyukjin Kwon commented on SPARK-21852: -- I generally agree with Sean and am quite sure this is not an issue. However, I want to make sure before resolving this (as at least I have seed few corner cases so far). BTW, I'd close Parquet's JIRA you opened. This does not look a Parquet issue. I would resolve this if any more details can't be provided. > Empty Parquet Files created as a result of spark jobs fail when read > > > Key: SPARK-21852 > URL: https://issues.apache.org/jira/browse/SPARK-21852 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.2.0 >Reporter: Shivam Dalmia > > I have faced an issue intermittently with certain spark jobs writing parquet > files which apparently succeed but the written .parquet directory in HDFS is > an empty directory (with no _SUCCESS and _metadata parts, even). > Surprisingly, no errors are thrown from spark dataframe writer. > However, when attempting to read this written file, spark throws the error: > {{Unable to infer schema for Parquet. It must be specified manually}} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21853) Getting an exception while calling the except method on the dataframe
[ https://issues.apache.org/jira/browse/SPARK-21853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shailesh Kini updated SPARK-21853: -- Attachment: SparkException.txt > Getting an exception while calling the except method on the dataframe > - > > Key: SPARK-21853 > URL: https://issues.apache.org/jira/browse/SPARK-21853 > Project: Spark > Issue Type: Question > Components: Spark Shell >Affects Versions: 2.1.1 >Reporter: Shailesh Kini > Attachments: SparkException.txt > > > I am getting an exception while calling except on the Dataset. > org.apache.spark.sql.AnalysisException: resolved attribute(s) > SVC_BILLING_PERIOD#37723 missing from > I have 2 csv. I create two dataset DS1 and DS2, which I join to create DS3. I > need to filter out duplicates for further processing. I aggregate on DS3 > dataset by on some columns and filter when the count > 1. This is DS4. I now > join DS3 with DS4 on those columns and get DS5. DS5 has the same structure as > DS3 as I drop the columns from the join. DS5 now has all the rows which are > duplicate. I then call the except on DS3 to get me a dataset DS5 which all > the rows not in DS5. I am planning to filter and remove of of the duplicates > (all the columns are not duplicates so I need to use filter) and union it > with DS6 to get the dataset free of duplicates. > Attaching the exception to this ticket. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21853) Getting an exception while calling the except method on the dataframe
Shailesh Kini created SPARK-21853: - Summary: Getting an exception while calling the except method on the dataframe Key: SPARK-21853 URL: https://issues.apache.org/jira/browse/SPARK-21853 Project: Spark Issue Type: Question Components: Spark Shell Affects Versions: 2.1.1 Reporter: Shailesh Kini I am getting an exception while calling except on the Dataset. org.apache.spark.sql.AnalysisException: resolved attribute(s) SVC_BILLING_PERIOD#37723 missing from I have 2 csv. I create two dataset DS1 and DS2, which I join to create DS3. I need to filter out duplicates for further processing. I aggregate on DS3 dataset by on some columns and filter when the count > 1. This is DS4. I now join DS3 with DS4 on those columns and get DS5. DS5 has the same structure as DS3 as I drop the columns from the join. DS5 now has all the rows which are duplicate. I then call the except on DS3 to get me a dataset DS5 which all the rows not in DS5. I am planning to filter and remove of of the duplicates (all the columns are not duplicates so I need to use filter) and union it with DS6 to get the dataset free of duplicates. Attaching the exception to this ticket. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15689) Data source API v2
[ https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-15689: Labels: SPIP releasenotes (was: releasenotes) > Data source API v2 > -- > > Key: SPARK-15689 > URL: https://issues.apache.org/jira/browse/SPARK-15689 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Labels: SPIP, releasenotes > Attachments: SPIP Data Source API V2.pdf > > > This ticket tracks progress in creating the v2 of data source API. This new > API should focus on: > 1. Have a small surface so it is easy to freeze and maintain compatibility > for a long time. Ideally, this API should survive architectural rewrites and > user-facing API revamps of Spark. > 2. Have a well-defined column batch interface for high performance. > Convenience methods should exist to convert row-oriented formats into column > batches for data source developers. > 3. Still support filter push down, similar to the existing API. > 4. Nice-to-have: support additional common operators, including limit and > sampling. > Note that both 1 and 2 are problems that the current data source API (v1) > suffers. The current data source API has a wide surface with dependency on > DataFrame/SQLContext, making the data source API compatibility depending on > the upper level API. The current data source API is also only row oriented > and has to go through an expensive external data type conversion to internal > data type. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21835) RewritePredicateSubquery should not produce unresolved query plans
[ https://issues.apache.org/jira/browse/SPARK-21835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21835: Assignee: Apache Spark > RewritePredicateSubquery should not produce unresolved query plans > -- > > Key: SPARK-21835 > URL: https://issues.apache.org/jira/browse/SPARK-21835 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark > > {{RewritePredicateSubquery}} rewrites correlated subquery to join operations. > During the structural integrity, I found {[RewritePredicateSubquery}} can > produce unresolved query plans due to conflicting attributes. We should not > let {{RewritePredicateSubquery}} produce unresolved plans. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21835) RewritePredicateSubquery should not produce unresolved query plans
[ https://issues.apache.org/jira/browse/SPARK-21835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21835: Assignee: (was: Apache Spark) > RewritePredicateSubquery should not produce unresolved query plans > -- > > Key: SPARK-21835 > URL: https://issues.apache.org/jira/browse/SPARK-21835 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh > > {{RewritePredicateSubquery}} rewrites correlated subquery to join operations. > During the structural integrity, I found {[RewritePredicateSubquery}} can > produce unresolved query plans due to conflicting attributes. We should not > let {{RewritePredicateSubquery}} produce unresolved plans. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21568) ConsoleProgressBar should only be enabled in shells
[ https://issues.apache.org/jira/browse/SPARK-21568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21568: Assignee: (was: Apache Spark) > ConsoleProgressBar should only be enabled in shells > --- > > Key: SPARK-21568 > URL: https://issues.apache.org/jira/browse/SPARK-21568 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin >Priority: Minor > > This is the current logic that enables the progress bar: > {code} > _progressBar = > if (_conf.getBoolean("spark.ui.showConsoleProgress", true) && > !log.isInfoEnabled) { > Some(new ConsoleProgressBar(this)) > } else { > None > } > {code} > That is based on the logging level; it just happens to align with the default > configuration for shells (WARN) and normal apps (INFO). > But if someone changes the default logging config for their app, this may > break; they may silence logs by setting the default level to WARN or ERROR, > and a normal application will see a lot of log spam from the progress bar > (which is especially bad when output is redirected to a file, as is usually > done when running in cluster mode). > While it's possible to disable the progress bar separately, this behavior is > not really expected. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21568) ConsoleProgressBar should only be enabled in shells
[ https://issues.apache.org/jira/browse/SPARK-21568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21568: Assignee: Apache Spark > ConsoleProgressBar should only be enabled in shells > --- > > Key: SPARK-21568 > URL: https://issues.apache.org/jira/browse/SPARK-21568 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Minor > > This is the current logic that enables the progress bar: > {code} > _progressBar = > if (_conf.getBoolean("spark.ui.showConsoleProgress", true) && > !log.isInfoEnabled) { > Some(new ConsoleProgressBar(this)) > } else { > None > } > {code} > That is based on the logging level; it just happens to align with the default > configuration for shells (WARN) and normal apps (INFO). > But if someone changes the default logging config for their app, this may > break; they may silence logs by setting the default level to WARN or ERROR, > and a normal application will see a lot of log spam from the progress bar > (which is especially bad when output is redirected to a file, as is usually > done when running in cluster mode). > While it's possible to disable the progress bar separately, this behavior is > not really expected. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21841) Spark SQL doesn't pick up column added in hive when table created with saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-21841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144540#comment-16144540 ] Marcelo Vanzin commented on SPARK-21841: "DataSource tables" (those created, in certain cases, with {{saveAsTable}}), have pretty spotty Hive compatibility. I've run into this in a recent PR (SPARK-21617) and [~smilegator] suggested having an explicit config added to ensure compatibility, although I don't think anyone is working on that. The workaround you have (using DDL SQL commands instead of doing it via Scala code) is what we have been suggesting to people for a really long time now. I haven't looked closely at the spec to see whether it covers this, but maybe this could be called out explicitly in SPARK-15689, which plans to update the DataSource APIs. > Spark SQL doesn't pick up column added in hive when table created with > saveAsTable > -- > > Key: SPARK-21841 > URL: https://issues.apache.org/jira/browse/SPARK-21841 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: Thomas Graves > > If you create a table in Spark sql but then you modify the table in hive to > add a column, spark sql doesn't pick up the new column. > Basic example: > {code} > t1 = spark.sql("select ip_address from mydb.test_table limit 1") > t1.show() > ++ > | ip_address| > ++ > |1.30.25.5| > ++ > t1.write.saveAsTable('mydb.t1') > In Hive: > alter table mydb.t1 add columns (bcookie string) > t1 = spark.table("mydb.t1") > t1.show() > ++ > | ip_address| > ++ > |1.30.25.5| > ++ > {code} > It looks like its because spark sql is picking up the schema from > spark.sql.sources.schema.part.0 rather then from hive. > Interestingly enough it appears that if you create the table differently like: > spark.sql("create table mydb.t1 select ip_address from mydb.test_table limit > 1") > Run your alter table on mydb.t1 > val t1 = spark.table("mydb.t1") > Then it works properly. > It looks like the difference is when it doesn't work > spark.sql.sources.provider=parquet is set. > Its doing this from createDataSourceTable where provider is parquet. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21729) Generic test for ProbabilisticClassifier to ensure consistent output columns
[ https://issues.apache.org/jira/browse/SPARK-21729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21729: Assignee: (was: Apache Spark) > Generic test for ProbabilisticClassifier to ensure consistent output columns > > > Key: SPARK-21729 > URL: https://issues.apache.org/jira/browse/SPARK-21729 > Project: Spark > Issue Type: Test > Components: ML >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley > > One challenge with the ProbabilisticClassifier abstraction is that it > introduces different code paths for predictions depending on which output > columns are turned on or off: probability, rawPrediction, prediction. We ran > into a bug in MLOR with this. > This task is for adding a generic test usable in all test suites for > ProbabilisticClassifier types which does the following: > * Take a dataset + Estimator > * Fit the Estimator > * Test prediction using the model with all combinations of output columns > turned on/off. > * Make sure the output column values match, presumably by comparing vs. the > case with all 3 output columns turned on > CC [~WeichenXu123] since this came up in > https://github.com/apache/spark/pull/17373 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21729) Generic test for ProbabilisticClassifier to ensure consistent output columns
[ https://issues.apache.org/jira/browse/SPARK-21729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21729: Assignee: Apache Spark > Generic test for ProbabilisticClassifier to ensure consistent output columns > > > Key: SPARK-21729 > URL: https://issues.apache.org/jira/browse/SPARK-21729 > Project: Spark > Issue Type: Test > Components: ML >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Assignee: Apache Spark > > One challenge with the ProbabilisticClassifier abstraction is that it > introduces different code paths for predictions depending on which output > columns are turned on or off: probability, rawPrediction, prediction. We ran > into a bug in MLOR with this. > This task is for adding a generic test usable in all test suites for > ProbabilisticClassifier types which does the following: > * Take a dataset + Estimator > * Fit the Estimator > * Test prediction using the model with all combinations of output columns > turned on/off. > * Make sure the output column values match, presumably by comparing vs. the > case with all 3 output columns turned on > CC [~WeichenXu123] since this came up in > https://github.com/apache/spark/pull/17373 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20990) Multi-line support for JSON
[ https://issues.apache.org/jira/browse/SPARK-20990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20990: Assignee: (was: Apache Spark) > Multi-line support for JSON > --- > > Key: SPARK-20990 > URL: https://issues.apache.org/jira/browse/SPARK-20990 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li > > When `multiLine` option is on, the existing JSON parser only reads the first > record. We should read the other records in the same file. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20990) Multi-line support for JSON
[ https://issues.apache.org/jira/browse/SPARK-20990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20990: Assignee: Apache Spark > Multi-line support for JSON > --- > > Key: SPARK-20990 > URL: https://issues.apache.org/jira/browse/SPARK-20990 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Apache Spark > > When `multiLine` option is on, the existing JSON parser only reads the first > record. We should read the other records in the same file. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21851) Spark 2.0 data corruption with cache and 200 columns
[ https://issues.apache.org/jira/browse/SPARK-21851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21851. --- Resolution: Duplicate > Spark 2.0 data corruption with cache and 200 columns > > > Key: SPARK-21851 > URL: https://issues.apache.org/jira/browse/SPARK-21851 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Anton Suchaneck > > Doing a join and cache can corrupt data as shown here: > {code} > import pyspark.sql.functions as F > num_rows=200 > for num_cols in range(198, 205): > # create data frame with id and some dummy cols > df1=spark.range(num_rows, numPartitions=100) > for i in range(num_cols-1): > df1=df1.withColumn("a"+str(i), F.lit("a")) > # create data frame with id to join > df2=spark.range(num_rows, numPartitions=100) > # write and read to start "fresh" > df1.write.parquet("delme_1.parquet", mode="overwrite") > df2.write.parquet("delme_2.parquet", mode="overwrite") > df1=spark.read.parquet("delme_1.parquet"); > df2=spark.read.parquet("delme_2.parquet"); > df3=df1.join(df2, "id", how="left").cache() # this cache seems to make > a difference > df4=df3.filter("id<10") > print(len(df4.columns), df4.count(), df4.cache().count()) # second > cache gives different result > {code} > Output: > {noformat} > 198 10 10 > 199 10 10 > 200 10 10 > 201 12 12 > 202 12 12 > 203 16 16 > 204 10 12 > {noformat} > Occasionally the middle number is also 10 (expected result) more often. Last > column may show different values, but 12 and 16 are common. Sometimes you can > try slightly higher num_rows to get this behaviour. > Spark version is 2.0.0.2.5.0.0-1245 on a Redhat system on a multiple node > YARN cluster. > I am happy to provide more information, if you let me know what is > interesting. > It's not strictly `cache` which is the problem, since `toPandas` and > `collect` fall for the same behavior and I basically can hardly get the data. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21848) Create trait to identify user-defined functions
[ https://issues.apache.org/jira/browse/SPARK-21848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21848: Assignee: Apache Spark > Create trait to identify user-defined functions > --- > > Key: SPARK-21848 > URL: https://issues.apache.org/jira/browse/SPARK-21848 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Minor > > Create a trait to make it easier for identifying what expressions are > user-defined functions -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21848) Create trait to identify user-defined functions
[ https://issues.apache.org/jira/browse/SPARK-21848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21848: Assignee: (was: Apache Spark) > Create trait to identify user-defined functions > --- > > Key: SPARK-21848 > URL: https://issues.apache.org/jira/browse/SPARK-21848 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Gengliang Wang >Priority: Minor > > Create a trait to make it easier for identifying what expressions are > user-defined functions -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21851) Spark 2.0 data corruption with cache and 200 columns
[ https://issues.apache.org/jira/browse/SPARK-21851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144403#comment-16144403 ] Dongjoon Hyun commented on SPARK-21851: --- For 1.6.2 issue, I think all vendors already deliver this HOTFIX to the customer. Please ask your support team. :) > Spark 2.0 data corruption with cache and 200 columns > > > Key: SPARK-21851 > URL: https://issues.apache.org/jira/browse/SPARK-21851 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Anton Suchaneck > > Doing a join and cache can corrupt data as shown here: > {code} > import pyspark.sql.functions as F > num_rows=200 > for num_cols in range(198, 205): > # create data frame with id and some dummy cols > df1=spark.range(num_rows, numPartitions=100) > for i in range(num_cols-1): > df1=df1.withColumn("a"+str(i), F.lit("a")) > # create data frame with id to join > df2=spark.range(num_rows, numPartitions=100) > # write and read to start "fresh" > df1.write.parquet("delme_1.parquet", mode="overwrite") > df2.write.parquet("delme_2.parquet", mode="overwrite") > df1=spark.read.parquet("delme_1.parquet"); > df2=spark.read.parquet("delme_2.parquet"); > df3=df1.join(df2, "id", how="left").cache() # this cache seems to make > a difference > df4=df3.filter("id<10") > print(len(df4.columns), df4.count(), df4.cache().count()) # second > cache gives different result > {code} > Output: > {noformat} > 198 10 10 > 199 10 10 > 200 10 10 > 201 12 12 > 202 12 12 > 203 16 16 > 204 10 12 > {noformat} > Occasionally the middle number is also 10 (expected result) more often. Last > column may show different values, but 12 and 16 are common. Sometimes you can > try slightly higher num_rows to get this behaviour. > Spark version is 2.0.0.2.5.0.0-1245 on a Redhat system on a multiple node > YARN cluster. > I am happy to provide more information, if you let me know what is > interesting. > It's not strictly `cache` which is the problem, since `toPandas` and > `collect` fall for the same behavior and I basically can hardly get the data. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21851) Spark 2.0 data corruption with cache and 200 columns
[ https://issues.apache.org/jira/browse/SPARK-21851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144394#comment-16144394 ] Anton Suchaneck commented on SPARK-21851: - Not quite production, but still for relevant work. Thanks for point it out. And I sure learned a lesson to watch the Jiras of x.0.0 versions ;) Actually judging by Hortonworks 2.5 and that Spark 1.6.2 is affected, you are screwed either way, even if you use the old Spark :-o > Spark 2.0 data corruption with cache and 200 columns > > > Key: SPARK-21851 > URL: https://issues.apache.org/jira/browse/SPARK-21851 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Anton Suchaneck > > Doing a join and cache can corrupt data as shown here: > {code} > import pyspark.sql.functions as F > num_rows=200 > for num_cols in range(198, 205): > # create data frame with id and some dummy cols > df1=spark.range(num_rows, numPartitions=100) > for i in range(num_cols-1): > df1=df1.withColumn("a"+str(i), F.lit("a")) > # create data frame with id to join > df2=spark.range(num_rows, numPartitions=100) > # write and read to start "fresh" > df1.write.parquet("delme_1.parquet", mode="overwrite") > df2.write.parquet("delme_2.parquet", mode="overwrite") > df1=spark.read.parquet("delme_1.parquet"); > df2=spark.read.parquet("delme_2.parquet"); > df3=df1.join(df2, "id", how="left").cache() # this cache seems to make > a difference > df4=df3.filter("id<10") > print(len(df4.columns), df4.count(), df4.cache().count()) # second > cache gives different result > {code} > Output: > {noformat} > 198 10 10 > 199 10 10 > 200 10 10 > 201 12 12 > 202 12 12 > 203 16 16 > 204 10 12 > {noformat} > Occasionally the middle number is also 10 (expected result) more often. Last > column may show different values, but 12 and 16 are common. Sometimes you can > try slightly higher num_rows to get this behaviour. > Spark version is 2.0.0.2.5.0.0-1245 on a Redhat system on a multiple node > YARN cluster. > I am happy to provide more information, if you let me know what is > interesting. > It's not strictly `cache` which is the problem, since `toPandas` and > `collect` fall for the same behavior and I basically can hardly get the data. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21851) Spark 2.0 data corruption with cache and 200 columns
[ https://issues.apache.org/jira/browse/SPARK-21851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144381#comment-16144381 ] Dongjoon Hyun commented on SPARK-21851: --- Unfortunately, there is no such software with no bugs. BTW, if your cluster is using Hortonworks, you know that Spark 2.0.0 is a technical preview due to that kind of issues. You are not using it in a production, are you? > Spark 2.0 data corruption with cache and 200 columns > > > Key: SPARK-21851 > URL: https://issues.apache.org/jira/browse/SPARK-21851 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Anton Suchaneck > > Doing a join and cache can corrupt data as shown here: > {code} > import pyspark.sql.functions as F > num_rows=200 > for num_cols in range(198, 205): > # create data frame with id and some dummy cols > df1=spark.range(num_rows, numPartitions=100) > for i in range(num_cols-1): > df1=df1.withColumn("a"+str(i), F.lit("a")) > # create data frame with id to join > df2=spark.range(num_rows, numPartitions=100) > # write and read to start "fresh" > df1.write.parquet("delme_1.parquet", mode="overwrite") > df2.write.parquet("delme_2.parquet", mode="overwrite") > df1=spark.read.parquet("delme_1.parquet"); > df2=spark.read.parquet("delme_2.parquet"); > df3=df1.join(df2, "id", how="left").cache() # this cache seems to make > a difference > df4=df3.filter("id<10") > print(len(df4.columns), df4.count(), df4.cache().count()) # second > cache gives different result > {code} > Output: > {noformat} > 198 10 10 > 199 10 10 > 200 10 10 > 201 12 12 > 202 12 12 > 203 16 16 > 204 10 12 > {noformat} > Occasionally the middle number is also 10 (expected result) more often. Last > column may show different values, but 12 and 16 are common. Sometimes you can > try slightly higher num_rows to get this behaviour. > Spark version is 2.0.0.2.5.0.0-1245 on a Redhat system on a multiple node > YARN cluster. > I am happy to provide more information, if you let me know what is > interesting. > It's not strictly `cache` which is the problem, since `toPandas` and > `collect` fall for the same behavior and I basically can hardly get the data. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21765) Ensure all leaf nodes that are derived from streaming sources have isStreaming=true
[ https://issues.apache.org/jira/browse/SPARK-21765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21765: Assignee: (was: Apache Spark) > Ensure all leaf nodes that are derived from streaming sources have > isStreaming=true > --- > > Key: SPARK-21765 > URL: https://issues.apache.org/jira/browse/SPARK-21765 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jose Torres > Fix For: 3.0.0 > > > LogicalPlan has an isStreaming bit, but it's incompletely implemented. Some > streaming sources don't set the bit, and the bit can sometimes be lost in > rewriting. Setting the bit for all plans that are logically streaming will > help us simplify the logic around checking query plan validity. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21765) Ensure all leaf nodes that are derived from streaming sources have isStreaming=true
[ https://issues.apache.org/jira/browse/SPARK-21765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21765: Assignee: Apache Spark > Ensure all leaf nodes that are derived from streaming sources have > isStreaming=true > --- > > Key: SPARK-21765 > URL: https://issues.apache.org/jira/browse/SPARK-21765 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jose Torres >Assignee: Apache Spark > Fix For: 3.0.0 > > > LogicalPlan has an isStreaming bit, but it's incompletely implemented. Some > streaming sources don't set the bit, and the bit can sometimes be lost in > rewriting. Setting the bit for all plans that are logically streaming will > help us simplify the logic around checking query plan validity. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21851) Spark 2.0 data corruption with cache and 200 columns
[ https://issues.apache.org/jira/browse/SPARK-21851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144376#comment-16144376 ] Anton Suchaneck commented on SPARK-21851: - I wish upgrading was that easy when you are in industry and using Hortonworks. Scary, that this means a lot of users are still affected by this bug. Can someone confirm that this bug affects .cache only (and not toPandas or collect)? Then at least I have a way around it... until someone installs a newer Spark. > Spark 2.0 data corruption with cache and 200 columns > > > Key: SPARK-21851 > URL: https://issues.apache.org/jira/browse/SPARK-21851 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Anton Suchaneck > > Doing a join and cache can corrupt data as shown here: > {code} > import pyspark.sql.functions as F > num_rows=200 > for num_cols in range(198, 205): > # create data frame with id and some dummy cols > df1=spark.range(num_rows, numPartitions=100) > for i in range(num_cols-1): > df1=df1.withColumn("a"+str(i), F.lit("a")) > # create data frame with id to join > df2=spark.range(num_rows, numPartitions=100) > # write and read to start "fresh" > df1.write.parquet("delme_1.parquet", mode="overwrite") > df2.write.parquet("delme_2.parquet", mode="overwrite") > df1=spark.read.parquet("delme_1.parquet"); > df2=spark.read.parquet("delme_2.parquet"); > df3=df1.join(df2, "id", how="left").cache() # this cache seems to make > a difference > df4=df3.filter("id<10") > print(len(df4.columns), df4.count(), df4.cache().count()) # second > cache gives different result > {code} > Output: > {noformat} > 198 10 10 > 199 10 10 > 200 10 10 > 201 12 12 > 202 12 12 > 203 16 16 > 204 10 12 > {noformat} > Occasionally the middle number is also 10 (expected result) more often. Last > column may show different values, but 12 and 16 are common. Sometimes you can > try slightly higher num_rows to get this behaviour. > Spark version is 2.0.0.2.5.0.0-1245 on a Redhat system on a multiple node > YARN cluster. > I am happy to provide more information, if you let me know what is > interesting. > It's not strictly `cache` which is the problem, since `toPandas` and > `collect` fall for the same behavior and I basically can hardly get the data. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21851) Spark 2.0 data corruption with cache and 200 columns
[ https://issues.apache.org/jira/browse/SPARK-21851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144361#comment-16144361 ] Dongjoon Hyun commented on SPARK-21851: --- Hi, [~Antsu]. This is fixed on 2.0.1, too.. Why don't you upgrade your system? > Spark 2.0 data corruption with cache and 200 columns > > > Key: SPARK-21851 > URL: https://issues.apache.org/jira/browse/SPARK-21851 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Anton Suchaneck > > Doing a join and cache can corrupt data as shown here: > {code} > import pyspark.sql.functions as F > num_rows=200 > for num_cols in range(198, 205): > # create data frame with id and some dummy cols > df1=spark.range(num_rows, numPartitions=100) > for i in range(num_cols-1): > df1=df1.withColumn("a"+str(i), F.lit("a")) > # create data frame with id to join > df2=spark.range(num_rows, numPartitions=100) > # write and read to start "fresh" > df1.write.parquet("delme_1.parquet", mode="overwrite") > df2.write.parquet("delme_2.parquet", mode="overwrite") > df1=spark.read.parquet("delme_1.parquet"); > df2=spark.read.parquet("delme_2.parquet"); > df3=df1.join(df2, "id", how="left").cache() # this cache seems to make > a difference > df4=df3.filter("id<10") > print(len(df4.columns), df4.count(), df4.cache().count()) # second > cache gives different result > {code} > Output: > {noformat} > 198 10 10 > 199 10 10 > 200 10 10 > 201 12 12 > 202 12 12 > 203 16 16 > 204 10 12 > {noformat} > Occasionally the middle number is also 10 (expected result) more often. Last > column may show different values, but 12 and 16 are common. Sometimes you can > try slightly higher num_rows to get this behaviour. > Spark version is 2.0.0.2.5.0.0-1245 on a Redhat system on a multiple node > YARN cluster. > I am happy to provide more information, if you let me know what is > interesting. > It's not strictly `cache` which is the problem, since `toPandas` and > `collect` fall for the same behavior and I basically can hardly get the data. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21765) Ensure all leaf nodes that are derived from streaming sources have isStreaming=true
[ https://issues.apache.org/jira/browse/SPARK-21765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21765: Assignee: (was: Apache Spark) > Ensure all leaf nodes that are derived from streaming sources have > isStreaming=true > --- > > Key: SPARK-21765 > URL: https://issues.apache.org/jira/browse/SPARK-21765 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jose Torres > Fix For: 3.0.0 > > > LogicalPlan has an isStreaming bit, but it's incompletely implemented. Some > streaming sources don't set the bit, and the bit can sometimes be lost in > rewriting. Setting the bit for all plans that are logically streaming will > help us simplify the logic around checking query plan validity. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21765) Ensure all leaf nodes that are derived from streaming sources have isStreaming=true
[ https://issues.apache.org/jira/browse/SPARK-21765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21765: Assignee: Apache Spark > Ensure all leaf nodes that are derived from streaming sources have > isStreaming=true > --- > > Key: SPARK-21765 > URL: https://issues.apache.org/jira/browse/SPARK-21765 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jose Torres >Assignee: Apache Spark > Fix For: 3.0.0 > > > LogicalPlan has an isStreaming bit, but it's incompletely implemented. Some > streaming sources don't set the bit, and the bit can sometimes be lost in > rewriting. Setting the bit for all plans that are logically streaming will > help us simplify the logic around checking query plan validity. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21765) Ensure all leaf nodes that are derived from streaming sources have isStreaming=true
[ https://issues.apache.org/jira/browse/SPARK-21765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21765: Assignee: Apache Spark > Ensure all leaf nodes that are derived from streaming sources have > isStreaming=true > --- > > Key: SPARK-21765 > URL: https://issues.apache.org/jira/browse/SPARK-21765 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jose Torres >Assignee: Apache Spark > Fix For: 3.0.0 > > > LogicalPlan has an isStreaming bit, but it's incompletely implemented. Some > streaming sources don't set the bit, and the bit can sometimes be lost in > rewriting. Setting the bit for all plans that are logically streaming will > help us simplify the logic around checking query plan validity. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21765) Ensure all leaf nodes that are derived from streaming sources have isStreaming=true
[ https://issues.apache.org/jira/browse/SPARK-21765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21765: Assignee: (was: Apache Spark) > Ensure all leaf nodes that are derived from streaming sources have > isStreaming=true > --- > > Key: SPARK-21765 > URL: https://issues.apache.org/jira/browse/SPARK-21765 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jose Torres > Fix For: 3.0.0 > > > LogicalPlan has an isStreaming bit, but it's incompletely implemented. Some > streaming sources don't set the bit, and the bit can sometimes be lost in > rewriting. Setting the bit for all plans that are logically streaming will > help us simplify the logic around checking query plan validity. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21765) Ensure all leaf nodes that are derived from streaming sources have isStreaming=true
[ https://issues.apache.org/jira/browse/SPARK-21765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21765: Assignee: Apache Spark > Ensure all leaf nodes that are derived from streaming sources have > isStreaming=true > --- > > Key: SPARK-21765 > URL: https://issues.apache.org/jira/browse/SPARK-21765 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jose Torres >Assignee: Apache Spark > Fix For: 3.0.0 > > > LogicalPlan has an isStreaming bit, but it's incompletely implemented. Some > streaming sources don't set the bit, and the bit can sometimes be lost in > rewriting. Setting the bit for all plans that are logically streaming will > help us simplify the logic around checking query plan validity. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21765) Ensure all leaf nodes that are derived from streaming sources have isStreaming=true
[ https://issues.apache.org/jira/browse/SPARK-21765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21765: Assignee: (was: Apache Spark) > Ensure all leaf nodes that are derived from streaming sources have > isStreaming=true > --- > > Key: SPARK-21765 > URL: https://issues.apache.org/jira/browse/SPARK-21765 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jose Torres > Fix For: 3.0.0 > > > LogicalPlan has an isStreaming bit, but it's incompletely implemented. Some > streaming sources don't set the bit, and the bit can sometimes be lost in > rewriting. Setting the bit for all plans that are logically streaming will > help us simplify the logic around checking query plan validity. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21839) Support SQL config for ORC compression
[ https://issues.apache.org/jira/browse/SPARK-21839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21839: Assignee: (was: Apache Spark) > Support SQL config for ORC compression > --- > > Key: SPARK-21839 > URL: https://issues.apache.org/jira/browse/SPARK-21839 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun > > This issue aims to provide `spark.sql.orc.compression.codec` like > `spark.sql.parquet.compression.codec`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21839) Support SQL config for ORC compression
[ https://issues.apache.org/jira/browse/SPARK-21839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21839: Assignee: Apache Spark > Support SQL config for ORC compression > --- > > Key: SPARK-21839 > URL: https://issues.apache.org/jira/browse/SPARK-21839 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark > > This issue aims to provide `spark.sql.orc.compression.codec` like > `spark.sql.parquet.compression.codec`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20990) Multi-line support for JSON
[ https://issues.apache.org/jira/browse/SPARK-20990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20990: Assignee: (was: Apache Spark) > Multi-line support for JSON > --- > > Key: SPARK-20990 > URL: https://issues.apache.org/jira/browse/SPARK-20990 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li > > When `multiLine` option is on, the existing JSON parser only reads the first > record. We should read the other records in the same file. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20990) Multi-line support for JSON
[ https://issues.apache.org/jira/browse/SPARK-20990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20990: Assignee: Apache Spark > Multi-line support for JSON > --- > > Key: SPARK-20990 > URL: https://issues.apache.org/jira/browse/SPARK-20990 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Apache Spark > > When `multiLine` option is on, the existing JSON parser only reads the first > record. We should read the other records in the same file. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20990) Multi-line support for JSON
[ https://issues.apache.org/jira/browse/SPARK-20990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20990: Assignee: Apache Spark > Multi-line support for JSON > --- > > Key: SPARK-20990 > URL: https://issues.apache.org/jira/browse/SPARK-20990 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Apache Spark > > When `multiLine` option is on, the existing JSON parser only reads the first > record. We should read the other records in the same file. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20990) Multi-line support for JSON
[ https://issues.apache.org/jira/browse/SPARK-20990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20990: Assignee: (was: Apache Spark) > Multi-line support for JSON > --- > > Key: SPARK-20990 > URL: https://issues.apache.org/jira/browse/SPARK-20990 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li > > When `multiLine` option is on, the existing JSON parser only reads the first > record. We should read the other records in the same file. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20990) Multi-line support for JSON
[ https://issues.apache.org/jira/browse/SPARK-20990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20990: Assignee: (was: Apache Spark) > Multi-line support for JSON > --- > > Key: SPARK-20990 > URL: https://issues.apache.org/jira/browse/SPARK-20990 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li > > When `multiLine` option is on, the existing JSON parser only reads the first > record. We should read the other records in the same file. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20990) Multi-line support for JSON
[ https://issues.apache.org/jira/browse/SPARK-20990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20990: Assignee: Apache Spark > Multi-line support for JSON > --- > > Key: SPARK-20990 > URL: https://issues.apache.org/jira/browse/SPARK-20990 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Apache Spark > > When `multiLine` option is on, the existing JSON parser only reads the first > record. We should read the other records in the same file. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17139) Add model summary for MultinomialLogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-17139. --- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 15435 [https://github.com/apache/spark/pull/15435] > Add model summary for MultinomialLogisticRegression > --- > > Key: SPARK-17139 > URL: https://issues.apache.org/jira/browse/SPARK-17139 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson >Assignee: Weichen Xu > Fix For: 2.3.0 > > > Add model summary to multinomial logistic regression using same interface as > in other ML models. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21764) Tests failures on Windows: resources not being closed and incorrect paths
[ https://issues.apache.org/jira/browse/SPARK-21764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21764: Assignee: (was: Apache Spark) > Tests failures on Windows: resources not being closed and incorrect paths > - > > Key: SPARK-21764 > URL: https://issues.apache.org/jira/browse/SPARK-21764 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > > This is actually a clone of https://issues.apache.org/jira/browse/SPARK-18922 > but decided to open another one here, targeting 2.3.0 as fixed version. > In short, there are many test failures on Windows, mainly due to resources > not being closed but attempted to be removed (this is failed on Windows) and > incorrect path inputs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21764) Tests failures on Windows: resources not being closed and incorrect paths
[ https://issues.apache.org/jira/browse/SPARK-21764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21764: Assignee: Apache Spark > Tests failures on Windows: resources not being closed and incorrect paths > - > > Key: SPARK-21764 > URL: https://issues.apache.org/jira/browse/SPARK-21764 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > This is actually a clone of https://issues.apache.org/jira/browse/SPARK-18922 > but decided to open another one here, targeting 2.3.0 as fixed version. > In short, there are many test failures on Windows, mainly due to resources > not being closed but attempted to be removed (this is failed on Windows) and > incorrect path inputs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17139) Add model summary for MultinomialLogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-17139: - Assignee: Weichen Xu > Add model summary for MultinomialLogisticRegression > --- > > Key: SPARK-17139 > URL: https://issues.apache.org/jira/browse/SPARK-17139 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson >Assignee: Weichen Xu > > Add model summary to multinomial logistic regression using same interface as > in other ML models. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21848) Create trait to identify user-defined functions
[ https://issues.apache.org/jira/browse/SPARK-21848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21848: Assignee: (was: Apache Spark) > Create trait to identify user-defined functions > --- > > Key: SPARK-21848 > URL: https://issues.apache.org/jira/browse/SPARK-21848 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Gengliang Wang >Priority: Minor > > Create a trait to make it easier for identifying what expressions are > user-defined functions -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21848) Create trait to identify user-defined functions
[ https://issues.apache.org/jira/browse/SPARK-21848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21848: Assignee: Apache Spark > Create trait to identify user-defined functions > --- > > Key: SPARK-21848 > URL: https://issues.apache.org/jira/browse/SPARK-21848 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Minor > > Create a trait to make it easier for identifying what expressions are > user-defined functions -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21848) Create trait to identify user-defined functions
[ https://issues.apache.org/jira/browse/SPARK-21848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21848: Assignee: Apache Spark > Create trait to identify user-defined functions > --- > > Key: SPARK-21848 > URL: https://issues.apache.org/jira/browse/SPARK-21848 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Minor > > Create a trait to make it easier for identifying what expressions are > user-defined functions -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21848) Create trait to identify user-defined functions
[ https://issues.apache.org/jira/browse/SPARK-21848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21848: Assignee: (was: Apache Spark) > Create trait to identify user-defined functions > --- > > Key: SPARK-21848 > URL: https://issues.apache.org/jira/browse/SPARK-21848 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Gengliang Wang >Priority: Minor > > Create a trait to make it easier for identifying what expressions are > user-defined functions -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21848) Create trait to identify user-defined functions
[ https://issues.apache.org/jira/browse/SPARK-21848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21848: Assignee: Apache Spark > Create trait to identify user-defined functions > --- > > Key: SPARK-21848 > URL: https://issues.apache.org/jira/browse/SPARK-21848 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Minor > > Create a trait to make it easier for identifying what expressions are > user-defined functions -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21848) Create trait to identify user-defined functions
[ https://issues.apache.org/jira/browse/SPARK-21848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21848: Assignee: (was: Apache Spark) > Create trait to identify user-defined functions > --- > > Key: SPARK-21848 > URL: https://issues.apache.org/jira/browse/SPARK-21848 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Gengliang Wang >Priority: Minor > > Create a trait to make it easier for identifying what expressions are > user-defined functions -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21764) Tests failures on Windows: resources not being closed and incorrect paths
[ https://issues.apache.org/jira/browse/SPARK-21764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21764: Assignee: Apache Spark > Tests failures on Windows: resources not being closed and incorrect paths > - > > Key: SPARK-21764 > URL: https://issues.apache.org/jira/browse/SPARK-21764 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > This is actually a clone of https://issues.apache.org/jira/browse/SPARK-18922 > but decided to open another one here, targeting 2.3.0 as fixed version. > In short, there are many test failures on Windows, mainly due to resources > not being closed but attempted to be removed (this is failed on Windows) and > incorrect path inputs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21764) Tests failures on Windows: resources not being closed and incorrect paths
[ https://issues.apache.org/jira/browse/SPARK-21764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21764: Assignee: (was: Apache Spark) > Tests failures on Windows: resources not being closed and incorrect paths > - > > Key: SPARK-21764 > URL: https://issues.apache.org/jira/browse/SPARK-21764 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > > This is actually a clone of https://issues.apache.org/jira/browse/SPARK-18922 > but decided to open another one here, targeting 2.3.0 as fixed version. > In short, there are many test failures on Windows, mainly due to resources > not being closed but attempted to be removed (this is failed on Windows) and > incorrect path inputs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21848) Create trait to identify user-defined functions
[ https://issues.apache.org/jira/browse/SPARK-21848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21848: Assignee: Apache Spark > Create trait to identify user-defined functions > --- > > Key: SPARK-21848 > URL: https://issues.apache.org/jira/browse/SPARK-21848 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Minor > > Create a trait to make it easier for identifying what expressions are > user-defined functions -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21848) Create trait to identify user-defined functions
[ https://issues.apache.org/jira/browse/SPARK-21848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21848: Assignee: (was: Apache Spark) > Create trait to identify user-defined functions > --- > > Key: SPARK-21848 > URL: https://issues.apache.org/jira/browse/SPARK-21848 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Gengliang Wang >Priority: Minor > > Create a trait to make it easier for identifying what expressions are > user-defined functions -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21839) Support SQL config for ORC compression
[ https://issues.apache.org/jira/browse/SPARK-21839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21839: Assignee: Apache Spark > Support SQL config for ORC compression > --- > > Key: SPARK-21839 > URL: https://issues.apache.org/jira/browse/SPARK-21839 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark > > This issue aims to provide `spark.sql.orc.compression.codec` like > `spark.sql.parquet.compression.codec`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21839) Support SQL config for ORC compression
[ https://issues.apache.org/jira/browse/SPARK-21839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21839: Assignee: (was: Apache Spark) > Support SQL config for ORC compression > --- > > Key: SPARK-21839 > URL: https://issues.apache.org/jira/browse/SPARK-21839 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun > > This issue aims to provide `spark.sql.orc.compression.codec` like > `spark.sql.parquet.compression.codec`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21839) Support SQL config for ORC compression
[ https://issues.apache.org/jira/browse/SPARK-21839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21839: Assignee: Apache Spark > Support SQL config for ORC compression > --- > > Key: SPARK-21839 > URL: https://issues.apache.org/jira/browse/SPARK-21839 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark > > This issue aims to provide `spark.sql.orc.compression.codec` like > `spark.sql.parquet.compression.codec`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21839) Support SQL config for ORC compression
[ https://issues.apache.org/jira/browse/SPARK-21839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21839: Assignee: (was: Apache Spark) > Support SQL config for ORC compression > --- > > Key: SPARK-21839 > URL: https://issues.apache.org/jira/browse/SPARK-21839 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun > > This issue aims to provide `spark.sql.orc.compression.codec` like > `spark.sql.parquet.compression.codec`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21839) Support SQL config for ORC compression
[ https://issues.apache.org/jira/browse/SPARK-21839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21839: Assignee: (was: Apache Spark) > Support SQL config for ORC compression > --- > > Key: SPARK-21839 > URL: https://issues.apache.org/jira/browse/SPARK-21839 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun > > This issue aims to provide `spark.sql.orc.compression.codec` like > `spark.sql.parquet.compression.codec`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21839) Support SQL config for ORC compression
[ https://issues.apache.org/jira/browse/SPARK-21839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21839: Assignee: Apache Spark > Support SQL config for ORC compression > --- > > Key: SPARK-21839 > URL: https://issues.apache.org/jira/browse/SPARK-21839 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark > > This issue aims to provide `spark.sql.orc.compression.codec` like > `spark.sql.parquet.compression.codec`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21829) Enable config to permanently blacklist a list of nodes
[ https://issues.apache.org/jira/browse/SPARK-21829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21829: Assignee: Apache Spark > Enable config to permanently blacklist a list of nodes > -- > > Key: SPARK-21829 > URL: https://issues.apache.org/jira/browse/SPARK-21829 > Project: Spark > Issue Type: New Feature > Components: Scheduler, Spark Core >Affects Versions: 2.1.1, 2.2.0 >Reporter: Luca Canali >Assignee: Apache Spark >Priority: Minor > > The idea for this proposal comes from a performance incident in a local > cluster where a job was found very slow because of a log tail of stragglers > due to 2 nodes in the cluster being slow to access a remote filesystem. > The issue was limited to the 2 machines and was related to external > configurations: the 2 machines that performed badly when accessing the remote > file system were behaving normally for other jobs in the cluster (a shared > YARN cluster). > With this new feature I propose to introduce a mechanism to allow users to > specify a list of nodes in the cluster where executors/tasks should not run > for a specific job. > The proposed implementation that I tested (see PR) uses the Spark blacklist > mechanism. With the parameter spark.blacklist.alwaysBlacklistedNodes, a list > of user-specified nodes is added to the blacklist at the start of the Spark > Context and it is never expired. > I have tested this on a YARN cluster on a case taken from the original > production problem and I confirm a performance improvement of about 5x for > the specific test case I have. I imagine that there can be other cases where > Spark users may want to blacklist a set of nodes. This can be used for > troubleshooting, including cases where certain nodes/executors are slow for a > given workload and this is caused by external agents, so the anomaly is not > picked up by the cluster manager. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21829) Enable config to permanently blacklist a list of nodes
[ https://issues.apache.org/jira/browse/SPARK-21829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21829: Assignee: (was: Apache Spark) > Enable config to permanently blacklist a list of nodes > -- > > Key: SPARK-21829 > URL: https://issues.apache.org/jira/browse/SPARK-21829 > Project: Spark > Issue Type: New Feature > Components: Scheduler, Spark Core >Affects Versions: 2.1.1, 2.2.0 >Reporter: Luca Canali >Priority: Minor > > The idea for this proposal comes from a performance incident in a local > cluster where a job was found very slow because of a log tail of stragglers > due to 2 nodes in the cluster being slow to access a remote filesystem. > The issue was limited to the 2 machines and was related to external > configurations: the 2 machines that performed badly when accessing the remote > file system were behaving normally for other jobs in the cluster (a shared > YARN cluster). > With this new feature I propose to introduce a mechanism to allow users to > specify a list of nodes in the cluster where executors/tasks should not run > for a specific job. > The proposed implementation that I tested (see PR) uses the Spark blacklist > mechanism. With the parameter spark.blacklist.alwaysBlacklistedNodes, a list > of user-specified nodes is added to the blacklist at the start of the Spark > Context and it is never expired. > I have tested this on a YARN cluster on a case taken from the original > production problem and I confirm a performance improvement of about 5x for > the specific test case I have. I imagine that there can be other cases where > Spark users may want to blacklist a set of nodes. This can be used for > troubleshooting, including cases where certain nodes/executors are slow for a > given workload and this is caused by external agents, so the anomaly is not > picked up by the cluster manager. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21513) SQL to_json should support all column types
[ https://issues.apache.org/jira/browse/SPARK-21513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21513: Assignee: Apache Spark > SQL to_json should support all column types > --- > > Key: SPARK-21513 > URL: https://issues.apache.org/jira/browse/SPARK-21513 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Aaron Davidson >Assignee: Apache Spark > Labels: Starter > > The built-in SQL UDF "to_json" currently supports serializing StructType > columns, as well as Arrays of StructType columns. If you attempt to use it on > a different type, for example a map, you get an error like this: > {code} > AnalysisException: cannot resolve 'structstojson(`tags`)' due to data type > mismatch: Input type mapmust be a struct or array of > structs.;; > {code} > This limitation seems arbitrary; if I were to go through the effort of > enclosing my map in a struct, it would be serializable. Same thing with any > other non-struct type. > Therefore the desired improvement is to allow to_json to operate directly on > any column type. The associated code is > [here|https://github.com/apache/spark/blob/86174ea89b39a300caaba6baffac70f3dc702788/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L653]. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21513) SQL to_json should support all column types
[ https://issues.apache.org/jira/browse/SPARK-21513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21513: Assignee: (was: Apache Spark) > SQL to_json should support all column types > --- > > Key: SPARK-21513 > URL: https://issues.apache.org/jira/browse/SPARK-21513 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Aaron Davidson > Labels: Starter > > The built-in SQL UDF "to_json" currently supports serializing StructType > columns, as well as Arrays of StructType columns. If you attempt to use it on > a different type, for example a map, you get an error like this: > {code} > AnalysisException: cannot resolve 'structstojson(`tags`)' due to data type > mismatch: Input type mapmust be a struct or array of > structs.;; > {code} > This limitation seems arbitrary; if I were to go through the effort of > enclosing my map in a struct, it would be serializable. Same thing with any > other non-struct type. > Therefore the desired improvement is to allow to_json to operate directly on > any column type. The associated code is > [here|https://github.com/apache/spark/blob/86174ea89b39a300caaba6baffac70f3dc702788/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L653]. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21513) SQL to_json should support all column types
[ https://issues.apache.org/jira/browse/SPARK-21513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21513: Assignee: Apache Spark > SQL to_json should support all column types > --- > > Key: SPARK-21513 > URL: https://issues.apache.org/jira/browse/SPARK-21513 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Aaron Davidson >Assignee: Apache Spark > Labels: Starter > > The built-in SQL UDF "to_json" currently supports serializing StructType > columns, as well as Arrays of StructType columns. If you attempt to use it on > a different type, for example a map, you get an error like this: > {code} > AnalysisException: cannot resolve 'structstojson(`tags`)' due to data type > mismatch: Input type mapmust be a struct or array of > structs.;; > {code} > This limitation seems arbitrary; if I were to go through the effort of > enclosing my map in a struct, it would be serializable. Same thing with any > other non-struct type. > Therefore the desired improvement is to allow to_json to operate directly on > any column type. The associated code is > [here|https://github.com/apache/spark/blob/86174ea89b39a300caaba6baffac70f3dc702788/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L653]. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21513) SQL to_json should support all column types
[ https://issues.apache.org/jira/browse/SPARK-21513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21513: Assignee: (was: Apache Spark) > SQL to_json should support all column types > --- > > Key: SPARK-21513 > URL: https://issues.apache.org/jira/browse/SPARK-21513 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Aaron Davidson > Labels: Starter > > The built-in SQL UDF "to_json" currently supports serializing StructType > columns, as well as Arrays of StructType columns. If you attempt to use it on > a different type, for example a map, you get an error like this: > {code} > AnalysisException: cannot resolve 'structstojson(`tags`)' due to data type > mismatch: Input type mapmust be a struct or array of > structs.;; > {code} > This limitation seems arbitrary; if I were to go through the effort of > enclosing my map in a struct, it would be serializable. Same thing with any > other non-struct type. > Therefore the desired improvement is to allow to_json to operate directly on > any column type. The associated code is > [here|https://github.com/apache/spark/blob/86174ea89b39a300caaba6baffac70f3dc702788/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L653]. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21513) SQL to_json should support all column types
[ https://issues.apache.org/jira/browse/SPARK-21513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21513: Assignee: Apache Spark > SQL to_json should support all column types > --- > > Key: SPARK-21513 > URL: https://issues.apache.org/jira/browse/SPARK-21513 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Aaron Davidson >Assignee: Apache Spark > Labels: Starter > > The built-in SQL UDF "to_json" currently supports serializing StructType > columns, as well as Arrays of StructType columns. If you attempt to use it on > a different type, for example a map, you get an error like this: > {code} > AnalysisException: cannot resolve 'structstojson(`tags`)' due to data type > mismatch: Input type mapmust be a struct or array of > structs.;; > {code} > This limitation seems arbitrary; if I were to go through the effort of > enclosing my map in a struct, it would be serializable. Same thing with any > other non-struct type. > Therefore the desired improvement is to allow to_json to operate directly on > any column type. The associated code is > [here|https://github.com/apache/spark/blob/86174ea89b39a300caaba6baffac70f3dc702788/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L653]. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21513) SQL to_json should support all column types
[ https://issues.apache.org/jira/browse/SPARK-21513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21513: Assignee: (was: Apache Spark) > SQL to_json should support all column types > --- > > Key: SPARK-21513 > URL: https://issues.apache.org/jira/browse/SPARK-21513 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Aaron Davidson > Labels: Starter > > The built-in SQL UDF "to_json" currently supports serializing StructType > columns, as well as Arrays of StructType columns. If you attempt to use it on > a different type, for example a map, you get an error like this: > {code} > AnalysisException: cannot resolve 'structstojson(`tags`)' due to data type > mismatch: Input type mapmust be a struct or array of > structs.;; > {code} > This limitation seems arbitrary; if I were to go through the effort of > enclosing my map in a struct, it would be serializable. Same thing with any > other non-struct type. > Therefore the desired improvement is to allow to_json to operate directly on > any column type. The associated code is > [here|https://github.com/apache/spark/blob/86174ea89b39a300caaba6baffac70f3dc702788/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L653]. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21835) RewritePredicateSubquery should not produce unresolved query plans
[ https://issues.apache.org/jira/browse/SPARK-21835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21835: Assignee: Apache Spark > RewritePredicateSubquery should not produce unresolved query plans > -- > > Key: SPARK-21835 > URL: https://issues.apache.org/jira/browse/SPARK-21835 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark > > {{RewritePredicateSubquery}} rewrites correlated subquery to join operations. > During the structural integrity, I found {[RewritePredicateSubquery}} can > produce unresolved query plans due to conflicting attributes. We should not > let {{RewritePredicateSubquery}} produce unresolved plans. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21835) RewritePredicateSubquery should not produce unresolved query plans
[ https://issues.apache.org/jira/browse/SPARK-21835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21835: Assignee: (was: Apache Spark) > RewritePredicateSubquery should not produce unresolved query plans > -- > > Key: SPARK-21835 > URL: https://issues.apache.org/jira/browse/SPARK-21835 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh > > {{RewritePredicateSubquery}} rewrites correlated subquery to join operations. > During the structural integrity, I found {[RewritePredicateSubquery}} can > produce unresolved query plans due to conflicting attributes. We should not > let {{RewritePredicateSubquery}} produce unresolved plans. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21835) RewritePredicateSubquery should not produce unresolved query plans
[ https://issues.apache.org/jira/browse/SPARK-21835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21835: Assignee: (was: Apache Spark) > RewritePredicateSubquery should not produce unresolved query plans > -- > > Key: SPARK-21835 > URL: https://issues.apache.org/jira/browse/SPARK-21835 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh > > {{RewritePredicateSubquery}} rewrites correlated subquery to join operations. > During the structural integrity, I found {[RewritePredicateSubquery}} can > produce unresolved query plans due to conflicting attributes. We should not > let {{RewritePredicateSubquery}} produce unresolved plans. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21835) RewritePredicateSubquery should not produce unresolved query plans
[ https://issues.apache.org/jira/browse/SPARK-21835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21835: Assignee: Apache Spark > RewritePredicateSubquery should not produce unresolved query plans > -- > > Key: SPARK-21835 > URL: https://issues.apache.org/jira/browse/SPARK-21835 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark > > {{RewritePredicateSubquery}} rewrites correlated subquery to join operations. > During the structural integrity, I found {[RewritePredicateSubquery}} can > produce unresolved query plans due to conflicting attributes. We should not > let {{RewritePredicateSubquery}} produce unresolved plans. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21835) RewritePredicateSubquery should not produce unresolved query plans
[ https://issues.apache.org/jira/browse/SPARK-21835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21835: Assignee: (was: Apache Spark) > RewritePredicateSubquery should not produce unresolved query plans > -- > > Key: SPARK-21835 > URL: https://issues.apache.org/jira/browse/SPARK-21835 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh > > {{RewritePredicateSubquery}} rewrites correlated subquery to join operations. > During the structural integrity, I found {[RewritePredicateSubquery}} can > produce unresolved query plans due to conflicting attributes. We should not > let {{RewritePredicateSubquery}} produce unresolved plans. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21835) RewritePredicateSubquery should not produce unresolved query plans
[ https://issues.apache.org/jira/browse/SPARK-21835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21835: Assignee: Apache Spark > RewritePredicateSubquery should not produce unresolved query plans > -- > > Key: SPARK-21835 > URL: https://issues.apache.org/jira/browse/SPARK-21835 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark > > {{RewritePredicateSubquery}} rewrites correlated subquery to join operations. > During the structural integrity, I found {[RewritePredicateSubquery}} can > produce unresolved query plans due to conflicting attributes. We should not > let {{RewritePredicateSubquery}} produce unresolved plans. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21834) Incorrect executor request in case of dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-21834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21834: Assignee: Apache Spark > Incorrect executor request in case of dynamic allocation > > > Key: SPARK-21834 > URL: https://issues.apache.org/jira/browse/SPARK-21834 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Sital Kedia >Assignee: Apache Spark > > killExecutor api currently does not allow killing an executor without > updating the total number of executors needed. In case of dynamic allocation > is turned on and the allocator tries to kill an executor, the scheduler > reduces the total number of executors needed ( see > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L635) > which is incorrect because the allocator already takes care of setting the > required number of executors itself. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21834) Incorrect executor request in case of dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-21834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21834: Assignee: (was: Apache Spark) > Incorrect executor request in case of dynamic allocation > > > Key: SPARK-21834 > URL: https://issues.apache.org/jira/browse/SPARK-21834 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Sital Kedia > > killExecutor api currently does not allow killing an executor without > updating the total number of executors needed. In case of dynamic allocation > is turned on and the allocator tries to kill an executor, the scheduler > reduces the total number of executors needed ( see > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L635) > which is incorrect because the allocator already takes care of setting the > required number of executors itself. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19662) Add Fair Scheduler Unit Test coverage for different build cases
[ https://issues.apache.org/jira/browse/SPARK-19662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-19662: Assignee: Eren Avsarogullari > Add Fair Scheduler Unit Test coverage for different build cases > --- > > Key: SPARK-19662 > URL: https://issues.apache.org/jira/browse/SPARK-19662 > Project: Spark > Issue Type: Test > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Eren Avsarogullari >Assignee: Eren Avsarogullari >Priority: Minor > Fix For: 2.3.0 > > > Fair Scheduler can be built via one of the following options: > - By setting a {{spark.scheduler.allocation.file}} property > - By setting {{fairscheduler.xml}} into classpath, > These options are checked in order and fair-scheduler is built via first > found one. If invalid path is found, {{FileNotFoundException}} will be > expected. > Related PR aims unit test coverage of these use cases and a minor > documentation change has been added for second option({{fairscheduler.xml}} > into classpath) to inform the user. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org