[jira] [Assigned] (SPARK-21690) one-pass imputer

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21690:


Assignee: Apache Spark  (was: zhengruifeng)

> one-pass imputer
> 
>
> Key: SPARK-21690
> URL: https://issues.apache.org/jira/browse/SPARK-21690
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: zhengruifeng
>Assignee: Apache Spark
>
> {code}
> val surrogates = $(inputCols).map { inputCol =>
>   val ic = col(inputCol)
>   val filtered = dataset.select(ic.cast(DoubleType))
> .filter(ic.isNotNull && ic =!= $(missingValue) && !ic.isNaN)
>   if(filtered.take(1).length == 0) {
> throw new SparkException(s"surrogate cannot be computed. " +
>   s"All the values in $inputCol are Null, Nan or 
> missingValue(${$(missingValue)})")
>   }
>   val surrogate = $(strategy) match {
> case Imputer.mean => filtered.select(avg(inputCol)).as[Double].first()
> case Imputer.median => filtered.stat.approxQuantile(inputCol, 
> Array(0.5), 0.001).head
>   }
>   surrogate
> }
> {code}
> Current impl of {{Imputer}} process one column after after another. In this 
> place, we should parallelize the processing in a more efficient way.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21690) one-pass imputer

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21690:


Assignee: zhengruifeng  (was: Apache Spark)

> one-pass imputer
> 
>
> Key: SPARK-21690
> URL: https://issues.apache.org/jira/browse/SPARK-21690
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>
> {code}
> val surrogates = $(inputCols).map { inputCol =>
>   val ic = col(inputCol)
>   val filtered = dataset.select(ic.cast(DoubleType))
> .filter(ic.isNotNull && ic =!= $(missingValue) && !ic.isNaN)
>   if(filtered.take(1).length == 0) {
> throw new SparkException(s"surrogate cannot be computed. " +
>   s"All the values in $inputCol are Null, Nan or 
> missingValue(${$(missingValue)})")
>   }
>   val surrogate = $(strategy) match {
> case Imputer.mean => filtered.select(avg(inputCol)).as[Double].first()
> case Imputer.median => filtered.stat.approxQuantile(inputCol, 
> Array(0.5), 0.001).head
>   }
>   surrogate
> }
> {code}
> Current impl of {{Imputer}} process one column after after another. In this 
> place, we should parallelize the processing in a more efficient way.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21858) Make Spark grouping_id() compatible with Hive grouping__id

2017-08-28 Thread Yann Byron (JIRA)
Yann Byron created SPARK-21858:
--

 Summary: Make Spark grouping_id() compatible with Hive grouping__id
 Key: SPARK-21858
 URL: https://issues.apache.org/jira/browse/SPARK-21858
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Yann Byron


If you want to migrate some ETLs using `grouping__id` in Hive to Spark and use 
Spark `grouping_id()` instead of Hive `grouping__id`, you will find difference 
between their evaluations.

Here is an example.
{code:java}
select A, B, grouping__id/grouping_id() from t group by A, B grouping sets((), 
(A), (B), (A,B))
{code}

Running it on Hive and Spark separately, you'll find this: (the selected 
attribute in selected grouping set is represented by (/) and  otherwise by (x))
||A B||Binary Expression in Spark||Spark||Hive||Binary Expression in Hive||B A||
|(x) (x)|11|3|0|00|(x) (x)|
|(x) (/)|10|2|2|10|(/) (x)|
|(/) (x)|01|1|1|01|(x) (/)|
|(/) (/)|00|0|3|11|(/) (/)|

As shown above,In Hive, (/) set to 0, (x) set to 1, and in Spark it's opposite.
Moreover, attributes in `group by` will reverse firstly in Hive. In Spark it'll 
be evaluated directly.

In my opinion, I suggest that modifying the behavior of `grouping_id()` make it 
compatible with Hive `grouping__id`.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21857) Exception in thread "main" java.lang.ExceptionInInitializerError

2017-08-28 Thread Nagamanoj (JIRA)
Nagamanoj created SPARK-21857:
-

 Summary: Exception in thread "main" 
java.lang.ExceptionInInitializerError
 Key: SPARK-21857
 URL: https://issues.apache.org/jira/browse/SPARK-21857
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.0
Reporter: Nagamanoj


After installing SPRAK using prebuilt version, when we run ./bin/pySpark
JAVA Version = Java 9
I'm getting the following exception

sing Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/08/28 20:06:43 INFO SparkContext: Running Spark version 2.2.0
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.hadoop.util.StringUtils.(StringUtils.java:80)
at 
org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
at 
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
at 
org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
at 
org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
at 
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
at 
org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
at 
org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2430)
at 
org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2430)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2430)
at org.apache.spark.SparkContext.(SparkContext.scala:295)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:909)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:901)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:564)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 1
at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3116)
at java.base/java.lang.String.substring(String.java:1885)
at org.apache.hadoop.util.Shell.(Shell.java:52




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21855) When submit job to yarn and add file multiple times,we should log error instead of warning

2017-08-28 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-21855:
-
Component/s: (was: Deploy)
 YARN

> When submit job to yarn and add file multiple times,we should log error 
> instead of warning
> --
>
> Key: SPARK-21855
> URL: https://issues.apache.org/jira/browse/SPARK-21855
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: zhoukang
>Priority: Trivial
>
> Now when submit job with yarn,and upload same file multiple times.We will 
> throw exception but logging level is warn.
> {code:java}
> 17/08/29 11:17:37 WARN yarn.Client: Resource 
> hdfs://tjwqtst-galaxy/spark/tjwqtst-transfer/scripts/oom/oom_script.sh added 
> multiple times to distributed cache.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel

2017-08-28 Thread Ming Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144719#comment-16144719
 ] 

Ming Jiang commented on SPARK-21856:


I can work on it, thanks!

> Update Python API for MultilayerPerceptronClassifierModel
> -
>
> Key: SPARK-21856
> URL: https://issues.apache.org/jira/browse/SPARK-21856
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>
> SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so 
> python API also need update.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel

2017-08-28 Thread Weichen Xu (JIRA)
Weichen Xu created SPARK-21856:
--

 Summary: Update Python API for MultilayerPerceptronClassifierModel
 Key: SPARK-21856
 URL: https://issues.apache.org/jira/browse/SPARK-21856
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Affects Versions: 2.3.0
Reporter: Weichen Xu


SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so 
python API also need update.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21855) When submit job to yarn and add file multiple times,we should log error instead of warning

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21855:


Assignee: Apache Spark

> When submit job to yarn and add file multiple times,we should log error 
> instead of warning
> --
>
> Key: SPARK-21855
> URL: https://issues.apache.org/jira/browse/SPARK-21855
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.2.0
>Reporter: zhoukang
>Assignee: Apache Spark
>Priority: Trivial
>
> Now when submit job with yarn,and upload same file multiple times.We will 
> throw exception but logging level is warn.
> {code:java}
> 17/08/29 11:17:37 WARN yarn.Client: Resource 
> hdfs://tjwqtst-galaxy/spark/tjwqtst-transfer/scripts/oom/oom_script.sh added 
> multiple times to distributed cache.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21855) When submit job to yarn and add file multiple times,we should log error instead of warning

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21855:


Assignee: (was: Apache Spark)

> When submit job to yarn and add file multiple times,we should log error 
> instead of warning
> --
>
> Key: SPARK-21855
> URL: https://issues.apache.org/jira/browse/SPARK-21855
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.2.0
>Reporter: zhoukang
>Priority: Trivial
>
> Now when submit job with yarn,and upload same file multiple times.We will 
> throw exception but logging level is warn.
> {code:java}
> 17/08/29 11:17:37 WARN yarn.Client: Resource 
> hdfs://tjwqtst-galaxy/spark/tjwqtst-transfer/scripts/oom/oom_script.sh added 
> multiple times to distributed cache.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21855) When submit job to yarn and add file multiple times,we should log error instead of warning

2017-08-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144710#comment-16144710
 ] 

Apache Spark commented on SPARK-21855:
--

User 'caneGuy' has created a pull request for this issue:
https://github.com/apache/spark/pull/19073

> When submit job to yarn and add file multiple times,we should log error 
> instead of warning
> --
>
> Key: SPARK-21855
> URL: https://issues.apache.org/jira/browse/SPARK-21855
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.2.0
>Reporter: zhoukang
>Priority: Trivial
>
> Now when submit job with yarn,and upload same file multiple times.We will 
> throw exception but logging level is warn.
> {code:java}
> 17/08/29 11:17:37 WARN yarn.Client: Resource 
> hdfs://tjwqtst-galaxy/spark/tjwqtst-transfer/scripts/oom/oom_script.sh added 
> multiple times to distributed cache.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21855) When submit job to yarn and add file multiple times,we should log error instead of warning

2017-08-28 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-21855:
-
Description: 
Now when submit job with yarn,and upload same file multiple times.We will throw 
exception but logging level is warn.

{code:java}
17/08/29 11:17:37 WARN yarn.Client: Resource 
hdfs://tjwqtst-galaxy/spark/tjwqtst-transfer/scripts/oom/oom_script.sh added 
multiple times to distributed cache.
{code}


> When submit job to yarn and add file multiple times,we should log error 
> instead of warning
> --
>
> Key: SPARK-21855
> URL: https://issues.apache.org/jira/browse/SPARK-21855
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.2.0
>Reporter: zhoukang
>Priority: Minor
>
> Now when submit job with yarn,and upload same file multiple times.We will 
> throw exception but logging level is warn.
> {code:java}
> 17/08/29 11:17:37 WARN yarn.Client: Resource 
> hdfs://tjwqtst-galaxy/spark/tjwqtst-transfer/scripts/oom/oom_script.sh added 
> multiple times to distributed cache.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21855) When submit job to yarn and add file multiple times,we should log error instead of warning

2017-08-28 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-21855:
-
Priority: Trivial  (was: Minor)

> When submit job to yarn and add file multiple times,we should log error 
> instead of warning
> --
>
> Key: SPARK-21855
> URL: https://issues.apache.org/jira/browse/SPARK-21855
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.2.0
>Reporter: zhoukang
>Priority: Trivial
>
> Now when submit job with yarn,and upload same file multiple times.We will 
> throw exception but logging level is warn.
> {code:java}
> 17/08/29 11:17:37 WARN yarn.Client: Resource 
> hdfs://tjwqtst-galaxy/spark/tjwqtst-transfer/scripts/oom/oom_script.sh added 
> multiple times to distributed cache.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21855) When submit job to yarn and add file multiple times,we should log error instead of warning

2017-08-28 Thread zhoukang (JIRA)
zhoukang created SPARK-21855:


 Summary: When submit job to yarn and add file multiple times,we 
should log error instead of warning
 Key: SPARK-21855
 URL: https://issues.apache.org/jira/browse/SPARK-21855
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 2.2.0
Reporter: zhoukang
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21853) Getting an exception while calling the except method on the dataframe

2017-08-28 Thread Shailesh Kini (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shailesh Kini updated SPARK-21853:
--
Issue Type: Bug  (was: Question)

> Getting an exception while calling the except method on the dataframe
> -
>
> Key: SPARK-21853
> URL: https://issues.apache.org/jira/browse/SPARK-21853
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.1.1
>Reporter: Shailesh Kini
> Attachments: SparkException.txt
>
>
> I am getting an exception while calling except on the Dataset.
> org.apache.spark.sql.AnalysisException: resolved attribute(s) 
> SVC_BILLING_PERIOD#37723 missing from
> I read 2 csv files into datasets DS1 and DS2, which I join (full outer) to 
> create DS3. DS3 has some rows which are similar with the exception of one 
> column. I need to isolate those rows and remove the similar rows. I use 
> groupBy with the count > 1 on a few columns in DS3 to get those similar rows 
> - dataset DS4. DS4 has only a few columns and not all so I join it back with 
> DS3 on the aggregate columns to get a new dataset DS5 which has the same 
> columns as DS3. To get a clean dataset without any of those similar rows, I 
> am calling DS3.except(DS5) which throws the exception. The attribute is one 
> of the filtering criteria I use which creating DS1.
> Attaching the exception to this ticket.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21853) Getting an exception while calling the except method on the dataframe

2017-08-28 Thread Shailesh Kini (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144697#comment-16144697
 ] 

Shailesh Kini commented on SPARK-21853:
---

As a work around, I saved the dataset DS3 into parquet format and read it again 
after which I was able to successfully call the except.

> Getting an exception while calling the except method on the dataframe
> -
>
> Key: SPARK-21853
> URL: https://issues.apache.org/jira/browse/SPARK-21853
> Project: Spark
>  Issue Type: Question
>  Components: Spark Shell
>Affects Versions: 2.1.1
>Reporter: Shailesh Kini
> Attachments: SparkException.txt
>
>
> I am getting an exception while calling except on the Dataset.
> org.apache.spark.sql.AnalysisException: resolved attribute(s) 
> SVC_BILLING_PERIOD#37723 missing from
> I read 2 csv files into datasets DS1 and DS2, which I join (full outer) to 
> create DS3. DS3 has some rows which are similar with the exception of one 
> column. I need to isolate those rows and remove the similar rows. I use 
> groupBy with the count > 1 on a few columns in DS3 to get those similar rows 
> - dataset DS4. DS4 has only a few columns and not all so I join it back with 
> DS3 on the aggregate columns to get a new dataset DS5 which has the same 
> columns as DS3. To get a clean dataset without any of those similar rows, I 
> am calling DS3.except(DS5) which throws the exception. The attribute is one 
> of the filtering criteria I use which creating DS1.
> Attaching the exception to this ticket.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17133) Improvements to linear methods in Spark

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17133:


Assignee: Apache Spark

> Improvements to linear methods in Spark
> ---
>
> Key: SPARK-17133
> URL: https://issues.apache.org/jira/browse/SPARK-17133
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>Assignee: Apache Spark
>
> This JIRA is for tracking several improvements that we should make to 
> Linear/Logistic regression in Spark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17133) Improvements to linear methods in Spark

2017-08-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144695#comment-16144695
 ] 

Apache Spark commented on SPARK-17133:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/19072

> Improvements to linear methods in Spark
> ---
>
> Key: SPARK-17133
> URL: https://issues.apache.org/jira/browse/SPARK-17133
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> This JIRA is for tracking several improvements that we should make to 
> Linear/Logistic regression in Spark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17133) Improvements to linear methods in Spark

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17133:


Assignee: (was: Apache Spark)

> Improvements to linear methods in Spark
> ---
>
> Key: SPARK-17133
> URL: https://issues.apache.org/jira/browse/SPARK-17133
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> This JIRA is for tracking several improvements that we should make to 
> Linear/Logistic regression in Spark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21839) Support SQL config for ORC compression

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21839:


Assignee: Apache Spark

> Support SQL config for ORC compression 
> ---
>
> Key: SPARK-21839
> URL: https://issues.apache.org/jira/browse/SPARK-21839
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> This issue aims to provide `spark.sql.orc.compression.codec` like 
> `spark.sql.parquet.compression.codec`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21839) Support SQL config for ORC compression

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21839:


Assignee: (was: Apache Spark)

> Support SQL config for ORC compression 
> ---
>
> Key: SPARK-21839
> URL: https://issues.apache.org/jira/browse/SPARK-21839
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>
> This issue aims to provide `spark.sql.orc.compression.codec` like 
> `spark.sql.parquet.compression.codec`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21853) Getting an exception while calling the except method on the dataframe

2017-08-28 Thread Shailesh Kini (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shailesh Kini updated SPARK-21853:
--
Description: 
I am getting an exception while calling except on the Dataset.

org.apache.spark.sql.AnalysisException: resolved attribute(s) 
SVC_BILLING_PERIOD#37723 missing from

I read 2 csv files into datasets DS1 and DS2, which I join (full outer) to 
create DS3. DS3 has some rows which are similar with the exception of one 
column. I need to isolate those rows and remove the similar rows. I use groupBy 
with the count > 1 on a few columns in DS3 to get those similar rows - dataset 
DS4. DS4 has only a few columns and not all so I join it back with DS3 on the 
aggregate columns to get a new dataset DS5 which has the same columns as DS3. 
To get a clean dataset without any of those similar rows, I am calling 
DS3.except(DS5) which throws the exception. The attribute is one of the 
filtering criteria I use which creating DS1.

Attaching the exception to this ticket.

  was:
I am getting an exception while calling except on the Dataset.

org.apache.spark.sql.AnalysisException: resolved attribute(s) 
SVC_BILLING_PERIOD#37723 missing from

I have 2 csv. I create two dataset DS1 and DS2, which I join to create DS3. I 
need to filter out duplicates for further processing. I aggregate on DS3 
dataset by on some columns and filter when the count > 1. This is DS4. I now 
join DS3 with DS4 on those columns and get DS5. DS5 has the same structure as 
DS3 as I drop the columns from the join. DS5 now has all the rows which are 
duplicate. I then call the except on DS3 to get me a dataset DS5 which all the 
rows not in DS5. I am planning to filter and remove of of the duplicates (all 
the columns are not duplicates so I need to use filter) and union it with DS6 
to get the dataset free of duplicates.

Attaching the exception to this ticket.


> Getting an exception while calling the except method on the dataframe
> -
>
> Key: SPARK-21853
> URL: https://issues.apache.org/jira/browse/SPARK-21853
> Project: Spark
>  Issue Type: Question
>  Components: Spark Shell
>Affects Versions: 2.1.1
>Reporter: Shailesh Kini
> Attachments: SparkException.txt
>
>
> I am getting an exception while calling except on the Dataset.
> org.apache.spark.sql.AnalysisException: resolved attribute(s) 
> SVC_BILLING_PERIOD#37723 missing from
> I read 2 csv files into datasets DS1 and DS2, which I join (full outer) to 
> create DS3. DS3 has some rows which are similar with the exception of one 
> column. I need to isolate those rows and remove the similar rows. I use 
> groupBy with the count > 1 on a few columns in DS3 to get those similar rows 
> - dataset DS4. DS4 has only a few columns and not all so I join it back with 
> DS3 on the aggregate columns to get a new dataset DS5 which has the same 
> columns as DS3. To get a clean dataset without any of those similar rows, I 
> am calling DS3.except(DS5) which throws the exception. The attribute is one 
> of the filtering criteria I use which creating DS1.
> Attaching the exception to this ticket.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21854) Python interface for MLOR summary

2017-08-28 Thread Weichen Xu (JIRA)
Weichen Xu created SPARK-21854:
--

 Summary: Python interface for MLOR summary
 Key: SPARK-21854
 URL: https://issues.apache.org/jira/browse/SPARK-21854
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Affects Versions: 2.3.0
Reporter: Weichen Xu


Python interface for MLOR summary



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21834) Incorrect executor request in case of dynamic allocation

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21834:


Assignee: Apache Spark

> Incorrect executor request in case of dynamic allocation
> 
>
> Key: SPARK-21834
> URL: https://issues.apache.org/jira/browse/SPARK-21834
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Assignee: Apache Spark
>
> killExecutor api currently does not allow killing an executor without 
> updating the total number of executors needed. In case of dynamic allocation 
> is turned on and the allocator tries to kill an executor, the scheduler 
> reduces the total number of executors needed ( see 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L635)
>  which is incorrect because the allocator already takes care of setting the 
> required number of executors itself. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21834) Incorrect executor request in case of dynamic allocation

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21834:


Assignee: (was: Apache Spark)

> Incorrect executor request in case of dynamic allocation
> 
>
> Key: SPARK-21834
> URL: https://issues.apache.org/jira/browse/SPARK-21834
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>
> killExecutor api currently does not allow killing an executor without 
> updating the total number of executors needed. In case of dynamic allocation 
> is turned on and the allocator tries to kill an executor, the scheduler 
> reduces the total number of executors needed ( see 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L635)
>  which is incorrect because the allocator already takes care of setting the 
> required number of executors itself. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21801) SparkR unit test randomly fail on trees

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21801:


Assignee: (was: Apache Spark)

> SparkR unit test randomly fail on trees
> ---
>
> Key: SPARK-21801
> URL: https://issues.apache.org/jira/browse/SPARK-21801
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>Priority: Critical
>
> SparkR unit test sometimes will randomly occur such error:
> ```
> 1. Error: spark.randomForest (@test_mllib_tree.R#236) 
> --
> java.lang.IllegalArgumentException: requirement failed: The input column 
> stridx_87ea3065aeb2 should have at least two distinct values.
> ```
> or
> ```
> 1. Error: spark.decisionTree (@test_mllib_tree.R#353) 
> --
> java.lang.IllegalArgumentException: requirement failed: The input column 
> stridx_d6a0b492cfa1 should have at least two distinct values.
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21801) SparkR unit test randomly fail on trees

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21801:


Assignee: Apache Spark

> SparkR unit test randomly fail on trees
> ---
>
> Key: SPARK-21801
> URL: https://issues.apache.org/jira/browse/SPARK-21801
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Critical
>
> SparkR unit test sometimes will randomly occur such error:
> ```
> 1. Error: spark.randomForest (@test_mllib_tree.R#236) 
> --
> java.lang.IllegalArgumentException: requirement failed: The input column 
> stridx_87ea3065aeb2 should have at least two distinct values.
> ```
> or
> ```
> 1. Error: spark.decisionTree (@test_mllib_tree.R#353) 
> --
> java.lang.IllegalArgumentException: requirement failed: The input column 
> stridx_d6a0b492cfa1 should have at least two distinct values.
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21852) Empty Parquet Files created as a result of spark jobs fail when read

2017-08-28 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144650#comment-16144650
 ] 

Hyukjin Kwon commented on SPARK-21852:
--

I generally agree with Sean and am quite sure this is not an issue. However, I 
want to make sure before resolving this (as at least I have seed few corner 
cases so far).

BTW, I'd close Parquet's JIRA you opened. This does not look a Parquet issue. I 
would resolve this if any more details can't be provided.

> Empty Parquet Files created as a result of spark jobs fail when read
> 
>
> Key: SPARK-21852
> URL: https://issues.apache.org/jira/browse/SPARK-21852
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.0
>Reporter: Shivam Dalmia
>
> I have faced an issue intermittently with certain spark jobs writing parquet 
> files which apparently succeed but the written .parquet directory in HDFS is 
> an empty directory (with no _SUCCESS and _metadata parts, even). 
> Surprisingly, no errors are thrown from spark dataframe writer.
> However, when attempting to read this written file, spark throws the error:
> {{Unable to infer schema for Parquet. It must be specified manually}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21853) Getting an exception while calling the except method on the dataframe

2017-08-28 Thread Shailesh Kini (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shailesh Kini updated SPARK-21853:
--
Attachment: SparkException.txt

> Getting an exception while calling the except method on the dataframe
> -
>
> Key: SPARK-21853
> URL: https://issues.apache.org/jira/browse/SPARK-21853
> Project: Spark
>  Issue Type: Question
>  Components: Spark Shell
>Affects Versions: 2.1.1
>Reporter: Shailesh Kini
> Attachments: SparkException.txt
>
>
> I am getting an exception while calling except on the Dataset.
> org.apache.spark.sql.AnalysisException: resolved attribute(s) 
> SVC_BILLING_PERIOD#37723 missing from
> I have 2 csv. I create two dataset DS1 and DS2, which I join to create DS3. I 
> need to filter out duplicates for further processing. I aggregate on DS3 
> dataset by on some columns and filter when the count > 1. This is DS4. I now 
> join DS3 with DS4 on those columns and get DS5. DS5 has the same structure as 
> DS3 as I drop the columns from the join. DS5 now has all the rows which are 
> duplicate. I then call the except on DS3 to get me a dataset DS5 which all 
> the rows not in DS5. I am planning to filter and remove of of the duplicates 
> (all the columns are not duplicates so I need to use filter) and union it 
> with DS6 to get the dataset free of duplicates.
> Attaching the exception to this ticket.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21853) Getting an exception while calling the except method on the dataframe

2017-08-28 Thread Shailesh Kini (JIRA)
Shailesh Kini created SPARK-21853:
-

 Summary: Getting an exception while calling the except method on 
the dataframe
 Key: SPARK-21853
 URL: https://issues.apache.org/jira/browse/SPARK-21853
 Project: Spark
  Issue Type: Question
  Components: Spark Shell
Affects Versions: 2.1.1
Reporter: Shailesh Kini


I am getting an exception while calling except on the Dataset.

org.apache.spark.sql.AnalysisException: resolved attribute(s) 
SVC_BILLING_PERIOD#37723 missing from

I have 2 csv. I create two dataset DS1 and DS2, which I join to create DS3. I 
need to filter out duplicates for further processing. I aggregate on DS3 
dataset by on some columns and filter when the count > 1. This is DS4. I now 
join DS3 with DS4 on those columns and get DS5. DS5 has the same structure as 
DS3 as I drop the columns from the join. DS5 now has all the rows which are 
duplicate. I then call the except on DS3 to get me a dataset DS5 which all the 
rows not in DS5. I am planning to filter and remove of of the duplicates (all 
the columns are not duplicates so I need to use filter) and union it with DS6 
to get the dataset free of duplicates.

Attaching the exception to this ticket.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15689) Data source API v2

2017-08-28 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-15689:

Labels: SPIP releasenotes  (was: releasenotes)

> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: SPIP, releasenotes
> Attachments: SPIP Data Source API V2.pdf
>
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21835) RewritePredicateSubquery should not produce unresolved query plans

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21835:


Assignee: Apache Spark

> RewritePredicateSubquery should not produce unresolved query plans
> --
>
> Key: SPARK-21835
> URL: https://issues.apache.org/jira/browse/SPARK-21835
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> {{RewritePredicateSubquery}} rewrites correlated subquery to join operations. 
> During the structural integrity, I found {[RewritePredicateSubquery}} can 
> produce unresolved query plans due to conflicting attributes. We should not 
> let {{RewritePredicateSubquery}} produce unresolved plans.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21835) RewritePredicateSubquery should not produce unresolved query plans

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21835:


Assignee: (was: Apache Spark)

> RewritePredicateSubquery should not produce unresolved query plans
> --
>
> Key: SPARK-21835
> URL: https://issues.apache.org/jira/browse/SPARK-21835
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>
> {{RewritePredicateSubquery}} rewrites correlated subquery to join operations. 
> During the structural integrity, I found {[RewritePredicateSubquery}} can 
> produce unresolved query plans due to conflicting attributes. We should not 
> let {{RewritePredicateSubquery}} produce unresolved plans.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21568) ConsoleProgressBar should only be enabled in shells

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21568:


Assignee: (was: Apache Spark)

> ConsoleProgressBar should only be enabled in shells
> ---
>
> Key: SPARK-21568
> URL: https://issues.apache.org/jira/browse/SPARK-21568
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> This is the current logic that enables the progress bar:
> {code}
> _progressBar =
>   if (_conf.getBoolean("spark.ui.showConsoleProgress", true) && 
> !log.isInfoEnabled) {
> Some(new ConsoleProgressBar(this))
>   } else {
> None
>   }
> {code}
> That is based on the logging level; it just happens to align with the default 
> configuration for shells (WARN) and normal apps (INFO).
> But if someone changes the default logging config for their app, this may 
> break; they may silence logs by setting the default level to WARN or ERROR, 
> and a normal application will see a lot of log spam from the progress bar 
> (which is especially bad when output is redirected to a file, as is usually 
> done when running in cluster mode).
> While it's possible to disable the progress bar separately, this behavior is 
> not really expected.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21568) ConsoleProgressBar should only be enabled in shells

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21568:


Assignee: Apache Spark

> ConsoleProgressBar should only be enabled in shells
> ---
>
> Key: SPARK-21568
> URL: https://issues.apache.org/jira/browse/SPARK-21568
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> This is the current logic that enables the progress bar:
> {code}
> _progressBar =
>   if (_conf.getBoolean("spark.ui.showConsoleProgress", true) && 
> !log.isInfoEnabled) {
> Some(new ConsoleProgressBar(this))
>   } else {
> None
>   }
> {code}
> That is based on the logging level; it just happens to align with the default 
> configuration for shells (WARN) and normal apps (INFO).
> But if someone changes the default logging config for their app, this may 
> break; they may silence logs by setting the default level to WARN or ERROR, 
> and a normal application will see a lot of log spam from the progress bar 
> (which is especially bad when output is redirected to a file, as is usually 
> done when running in cluster mode).
> While it's possible to disable the progress bar separately, this behavior is 
> not really expected.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21841) Spark SQL doesn't pick up column added in hive when table created with saveAsTable

2017-08-28 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144540#comment-16144540
 ] 

Marcelo Vanzin commented on SPARK-21841:


"DataSource tables" (those created, in certain cases, with {{saveAsTable}}), 
have pretty spotty Hive compatibility. I've run into this in a recent PR 
(SPARK-21617) and [~smilegator] suggested having an explicit config added to 
ensure compatibility, although I don't think anyone is working on that.

The workaround you have (using DDL SQL commands instead of doing it via Scala 
code) is what we have been suggesting to people for a really long time now.

I haven't looked closely at the spec to see whether it covers this, but maybe 
this could be called out explicitly in SPARK-15689, which plans to update the 
DataSource APIs.


> Spark SQL doesn't pick up column added in hive when table created with 
> saveAsTable
> --
>
> Key: SPARK-21841
> URL: https://issues.apache.org/jira/browse/SPARK-21841
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Thomas Graves
>
> If you create a table in Spark sql but then you modify the table in hive to 
> add a column, spark sql doesn't pick up the new column.
> Basic example:
> {code}
> t1 = spark.sql("select ip_address from mydb.test_table limit 1")
> t1.show()
> ++
> |  ip_address|
> ++
> |1.30.25.5|
> ++
> t1.write.saveAsTable('mydb.t1')
> In Hive:
> alter table mydb.t1 add columns (bcookie string)
> t1 = spark.table("mydb.t1")
> t1.show()
> ++
> |  ip_address|
> ++
> |1.30.25.5|
> ++
> {code}
> It looks like its because spark sql is picking up the schema from 
> spark.sql.sources.schema.part.0 rather then from hive. 
> Interestingly enough it appears that if you create the table differently like:
> spark.sql("create table mydb.t1 select ip_address from mydb.test_table limit 
> 1") 
> Run your alter table on mydb.t1
> val t1 = spark.table("mydb.t1")  
> Then it works properly.
> It looks like the difference is when it doesn't work 
> spark.sql.sources.provider=parquet is set.
> Its doing this from createDataSourceTable where provider is parquet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21729) Generic test for ProbabilisticClassifier to ensure consistent output columns

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21729:


Assignee: (was: Apache Spark)

> Generic test for ProbabilisticClassifier to ensure consistent output columns
> 
>
> Key: SPARK-21729
> URL: https://issues.apache.org/jira/browse/SPARK-21729
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>
> One challenge with the ProbabilisticClassifier abstraction is that it 
> introduces different code paths for predictions depending on which output 
> columns are turned on or off: probability, rawPrediction, prediction.  We ran 
> into a bug in MLOR with this.
> This task is for adding a generic test usable in all test suites for 
> ProbabilisticClassifier types which does the following:
> * Take a dataset + Estimator
> * Fit the Estimator
> * Test prediction using the model with all combinations of output columns 
> turned on/off.
> * Make sure the output column values match, presumably by comparing vs. the 
> case with all 3 output columns turned on
> CC [~WeichenXu123] since this came up in 
> https://github.com/apache/spark/pull/17373



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21729) Generic test for ProbabilisticClassifier to ensure consistent output columns

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21729:


Assignee: Apache Spark

> Generic test for ProbabilisticClassifier to ensure consistent output columns
> 
>
> Key: SPARK-21729
> URL: https://issues.apache.org/jira/browse/SPARK-21729
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> One challenge with the ProbabilisticClassifier abstraction is that it 
> introduces different code paths for predictions depending on which output 
> columns are turned on or off: probability, rawPrediction, prediction.  We ran 
> into a bug in MLOR with this.
> This task is for adding a generic test usable in all test suites for 
> ProbabilisticClassifier types which does the following:
> * Take a dataset + Estimator
> * Fit the Estimator
> * Test prediction using the model with all combinations of output columns 
> turned on/off.
> * Make sure the output column values match, presumably by comparing vs. the 
> case with all 3 output columns turned on
> CC [~WeichenXu123] since this came up in 
> https://github.com/apache/spark/pull/17373



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20990) Multi-line support for JSON

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20990:


Assignee: (was: Apache Spark)

> Multi-line support for JSON
> ---
>
> Key: SPARK-20990
> URL: https://issues.apache.org/jira/browse/SPARK-20990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>
> When `multiLine` option is on, the existing JSON parser only reads the first 
> record. We should read the other records in the same file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20990) Multi-line support for JSON

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20990:


Assignee: Apache Spark

> Multi-line support for JSON
> ---
>
> Key: SPARK-20990
> URL: https://issues.apache.org/jira/browse/SPARK-20990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> When `multiLine` option is on, the existing JSON parser only reads the first 
> record. We should read the other records in the same file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21851) Spark 2.0 data corruption with cache and 200 columns

2017-08-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21851.
---
Resolution: Duplicate

> Spark 2.0 data corruption with cache and 200 columns
> 
>
> Key: SPARK-21851
> URL: https://issues.apache.org/jira/browse/SPARK-21851
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Anton Suchaneck
>
> Doing a join and cache can corrupt data as shown here:
> {code}
> import pyspark.sql.functions as F
> num_rows=200
> for num_cols in range(198, 205):
> # create data frame with id and some dummy cols
> df1=spark.range(num_rows, numPartitions=100)
> for i in range(num_cols-1):
> df1=df1.withColumn("a"+str(i), F.lit("a"))
> # create data frame with id to join
> df2=spark.range(num_rows, numPartitions=100)
> # write and read to start "fresh"
> df1.write.parquet("delme_1.parquet", mode="overwrite")
> df2.write.parquet("delme_2.parquet", mode="overwrite")
> df1=spark.read.parquet("delme_1.parquet");
> df2=spark.read.parquet("delme_2.parquet");
> df3=df1.join(df2, "id", how="left").cache()   # this cache seems to make 
> a difference
> df4=df3.filter("id<10")
> print(len(df4.columns), df4.count(), df4.cache().count())   # second 
> cache gives different result
> {code}
> Output:
> {noformat}
> 198 10 10
> 199 10 10
> 200 10 10
> 201 12 12
> 202 12 12
> 203 16 16
> 204 10 12
> {noformat}
> Occasionally the middle number is also 10 (expected result) more often. Last 
> column may show different values, but 12 and 16 are common. Sometimes you can 
> try slightly higher num_rows to get this behaviour.
> Spark version is 2.0.0.2.5.0.0-1245 on a Redhat system on a multiple node 
> YARN cluster.
> I am happy to provide more information, if you let me know what is 
> interesting.
> It's not strictly `cache` which is the problem, since `toPandas` and 
> `collect` fall for the same behavior and I basically can hardly get the data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21848) Create trait to identify user-defined functions

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21848:


Assignee: Apache Spark

> Create trait to identify user-defined functions
> ---
>
> Key: SPARK-21848
> URL: https://issues.apache.org/jira/browse/SPARK-21848
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Minor
>
> Create a trait to make it easier for identifying what expressions are 
> user-defined functions



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21848) Create trait to identify user-defined functions

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21848:


Assignee: (was: Apache Spark)

> Create trait to identify user-defined functions
> ---
>
> Key: SPARK-21848
> URL: https://issues.apache.org/jira/browse/SPARK-21848
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> Create a trait to make it easier for identifying what expressions are 
> user-defined functions



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21851) Spark 2.0 data corruption with cache and 200 columns

2017-08-28 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144403#comment-16144403
 ] 

Dongjoon Hyun commented on SPARK-21851:
---

For 1.6.2 issue, I think all vendors already deliver this HOTFIX to the 
customer. Please ask your support team. :)

> Spark 2.0 data corruption with cache and 200 columns
> 
>
> Key: SPARK-21851
> URL: https://issues.apache.org/jira/browse/SPARK-21851
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Anton Suchaneck
>
> Doing a join and cache can corrupt data as shown here:
> {code}
> import pyspark.sql.functions as F
> num_rows=200
> for num_cols in range(198, 205):
> # create data frame with id and some dummy cols
> df1=spark.range(num_rows, numPartitions=100)
> for i in range(num_cols-1):
> df1=df1.withColumn("a"+str(i), F.lit("a"))
> # create data frame with id to join
> df2=spark.range(num_rows, numPartitions=100)
> # write and read to start "fresh"
> df1.write.parquet("delme_1.parquet", mode="overwrite")
> df2.write.parquet("delme_2.parquet", mode="overwrite")
> df1=spark.read.parquet("delme_1.parquet");
> df2=spark.read.parquet("delme_2.parquet");
> df3=df1.join(df2, "id", how="left").cache()   # this cache seems to make 
> a difference
> df4=df3.filter("id<10")
> print(len(df4.columns), df4.count(), df4.cache().count())   # second 
> cache gives different result
> {code}
> Output:
> {noformat}
> 198 10 10
> 199 10 10
> 200 10 10
> 201 12 12
> 202 12 12
> 203 16 16
> 204 10 12
> {noformat}
> Occasionally the middle number is also 10 (expected result) more often. Last 
> column may show different values, but 12 and 16 are common. Sometimes you can 
> try slightly higher num_rows to get this behaviour.
> Spark version is 2.0.0.2.5.0.0-1245 on a Redhat system on a multiple node 
> YARN cluster.
> I am happy to provide more information, if you let me know what is 
> interesting.
> It's not strictly `cache` which is the problem, since `toPandas` and 
> `collect` fall for the same behavior and I basically can hardly get the data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21851) Spark 2.0 data corruption with cache and 200 columns

2017-08-28 Thread Anton Suchaneck (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144394#comment-16144394
 ] 

Anton Suchaneck commented on SPARK-21851:
-

Not quite production, but still for relevant work. Thanks for point it out. And 
I sure learned a lesson to watch the Jiras of x.0.0 versions ;) Actually 
judging by Hortonworks 2.5 and that Spark 1.6.2 is affected, you are screwed 
either way, even if you use the old Spark :-o

> Spark 2.0 data corruption with cache and 200 columns
> 
>
> Key: SPARK-21851
> URL: https://issues.apache.org/jira/browse/SPARK-21851
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Anton Suchaneck
>
> Doing a join and cache can corrupt data as shown here:
> {code}
> import pyspark.sql.functions as F
> num_rows=200
> for num_cols in range(198, 205):
> # create data frame with id and some dummy cols
> df1=spark.range(num_rows, numPartitions=100)
> for i in range(num_cols-1):
> df1=df1.withColumn("a"+str(i), F.lit("a"))
> # create data frame with id to join
> df2=spark.range(num_rows, numPartitions=100)
> # write and read to start "fresh"
> df1.write.parquet("delme_1.parquet", mode="overwrite")
> df2.write.parquet("delme_2.parquet", mode="overwrite")
> df1=spark.read.parquet("delme_1.parquet");
> df2=spark.read.parquet("delme_2.parquet");
> df3=df1.join(df2, "id", how="left").cache()   # this cache seems to make 
> a difference
> df4=df3.filter("id<10")
> print(len(df4.columns), df4.count(), df4.cache().count())   # second 
> cache gives different result
> {code}
> Output:
> {noformat}
> 198 10 10
> 199 10 10
> 200 10 10
> 201 12 12
> 202 12 12
> 203 16 16
> 204 10 12
> {noformat}
> Occasionally the middle number is also 10 (expected result) more often. Last 
> column may show different values, but 12 and 16 are common. Sometimes you can 
> try slightly higher num_rows to get this behaviour.
> Spark version is 2.0.0.2.5.0.0-1245 on a Redhat system on a multiple node 
> YARN cluster.
> I am happy to provide more information, if you let me know what is 
> interesting.
> It's not strictly `cache` which is the problem, since `toPandas` and 
> `collect` fall for the same behavior and I basically can hardly get the data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21851) Spark 2.0 data corruption with cache and 200 columns

2017-08-28 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144381#comment-16144381
 ] 

Dongjoon Hyun commented on SPARK-21851:
---

Unfortunately, there is no such software with no bugs. BTW, if your cluster is 
using Hortonworks, you know that Spark 2.0.0 is a technical preview due to that 
kind of issues. You are not using it in a production, are you?

> Spark 2.0 data corruption with cache and 200 columns
> 
>
> Key: SPARK-21851
> URL: https://issues.apache.org/jira/browse/SPARK-21851
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Anton Suchaneck
>
> Doing a join and cache can corrupt data as shown here:
> {code}
> import pyspark.sql.functions as F
> num_rows=200
> for num_cols in range(198, 205):
> # create data frame with id and some dummy cols
> df1=spark.range(num_rows, numPartitions=100)
> for i in range(num_cols-1):
> df1=df1.withColumn("a"+str(i), F.lit("a"))
> # create data frame with id to join
> df2=spark.range(num_rows, numPartitions=100)
> # write and read to start "fresh"
> df1.write.parquet("delme_1.parquet", mode="overwrite")
> df2.write.parquet("delme_2.parquet", mode="overwrite")
> df1=spark.read.parquet("delme_1.parquet");
> df2=spark.read.parquet("delme_2.parquet");
> df3=df1.join(df2, "id", how="left").cache()   # this cache seems to make 
> a difference
> df4=df3.filter("id<10")
> print(len(df4.columns), df4.count(), df4.cache().count())   # second 
> cache gives different result
> {code}
> Output:
> {noformat}
> 198 10 10
> 199 10 10
> 200 10 10
> 201 12 12
> 202 12 12
> 203 16 16
> 204 10 12
> {noformat}
> Occasionally the middle number is also 10 (expected result) more often. Last 
> column may show different values, but 12 and 16 are common. Sometimes you can 
> try slightly higher num_rows to get this behaviour.
> Spark version is 2.0.0.2.5.0.0-1245 on a Redhat system on a multiple node 
> YARN cluster.
> I am happy to provide more information, if you let me know what is 
> interesting.
> It's not strictly `cache` which is the problem, since `toPandas` and 
> `collect` fall for the same behavior and I basically can hardly get the data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21765) Ensure all leaf nodes that are derived from streaming sources have isStreaming=true

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21765:


Assignee: (was: Apache Spark)

> Ensure all leaf nodes that are derived from streaming sources have 
> isStreaming=true
> ---
>
> Key: SPARK-21765
> URL: https://issues.apache.org/jira/browse/SPARK-21765
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jose Torres
> Fix For: 3.0.0
>
>
> LogicalPlan has an isStreaming bit, but it's incompletely implemented. Some 
> streaming sources don't set the bit, and the bit can sometimes be lost in 
> rewriting. Setting the bit for all plans that are logically streaming will 
> help us simplify the logic around checking query plan validity.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21765) Ensure all leaf nodes that are derived from streaming sources have isStreaming=true

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21765:


Assignee: Apache Spark

> Ensure all leaf nodes that are derived from streaming sources have 
> isStreaming=true
> ---
>
> Key: SPARK-21765
> URL: https://issues.apache.org/jira/browse/SPARK-21765
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jose Torres
>Assignee: Apache Spark
> Fix For: 3.0.0
>
>
> LogicalPlan has an isStreaming bit, but it's incompletely implemented. Some 
> streaming sources don't set the bit, and the bit can sometimes be lost in 
> rewriting. Setting the bit for all plans that are logically streaming will 
> help us simplify the logic around checking query plan validity.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21851) Spark 2.0 data corruption with cache and 200 columns

2017-08-28 Thread Anton Suchaneck (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144376#comment-16144376
 ] 

Anton Suchaneck commented on SPARK-21851:
-

I wish upgrading was that easy when you are in industry and using Hortonworks. 
Scary, that this means a lot of users are still affected by this bug. Can 
someone confirm that this bug affects .cache only (and not toPandas or 
collect)? Then at least I have a way around it... until someone installs a 
newer Spark.

> Spark 2.0 data corruption with cache and 200 columns
> 
>
> Key: SPARK-21851
> URL: https://issues.apache.org/jira/browse/SPARK-21851
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Anton Suchaneck
>
> Doing a join and cache can corrupt data as shown here:
> {code}
> import pyspark.sql.functions as F
> num_rows=200
> for num_cols in range(198, 205):
> # create data frame with id and some dummy cols
> df1=spark.range(num_rows, numPartitions=100)
> for i in range(num_cols-1):
> df1=df1.withColumn("a"+str(i), F.lit("a"))
> # create data frame with id to join
> df2=spark.range(num_rows, numPartitions=100)
> # write and read to start "fresh"
> df1.write.parquet("delme_1.parquet", mode="overwrite")
> df2.write.parquet("delme_2.parquet", mode="overwrite")
> df1=spark.read.parquet("delme_1.parquet");
> df2=spark.read.parquet("delme_2.parquet");
> df3=df1.join(df2, "id", how="left").cache()   # this cache seems to make 
> a difference
> df4=df3.filter("id<10")
> print(len(df4.columns), df4.count(), df4.cache().count())   # second 
> cache gives different result
> {code}
> Output:
> {noformat}
> 198 10 10
> 199 10 10
> 200 10 10
> 201 12 12
> 202 12 12
> 203 16 16
> 204 10 12
> {noformat}
> Occasionally the middle number is also 10 (expected result) more often. Last 
> column may show different values, but 12 and 16 are common. Sometimes you can 
> try slightly higher num_rows to get this behaviour.
> Spark version is 2.0.0.2.5.0.0-1245 on a Redhat system on a multiple node 
> YARN cluster.
> I am happy to provide more information, if you let me know what is 
> interesting.
> It's not strictly `cache` which is the problem, since `toPandas` and 
> `collect` fall for the same behavior and I basically can hardly get the data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21851) Spark 2.0 data corruption with cache and 200 columns

2017-08-28 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144361#comment-16144361
 ] 

Dongjoon Hyun commented on SPARK-21851:
---

Hi, [~Antsu].
This is fixed on 2.0.1, too.. Why don't you upgrade your system?

> Spark 2.0 data corruption with cache and 200 columns
> 
>
> Key: SPARK-21851
> URL: https://issues.apache.org/jira/browse/SPARK-21851
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Anton Suchaneck
>
> Doing a join and cache can corrupt data as shown here:
> {code}
> import pyspark.sql.functions as F
> num_rows=200
> for num_cols in range(198, 205):
> # create data frame with id and some dummy cols
> df1=spark.range(num_rows, numPartitions=100)
> for i in range(num_cols-1):
> df1=df1.withColumn("a"+str(i), F.lit("a"))
> # create data frame with id to join
> df2=spark.range(num_rows, numPartitions=100)
> # write and read to start "fresh"
> df1.write.parquet("delme_1.parquet", mode="overwrite")
> df2.write.parquet("delme_2.parquet", mode="overwrite")
> df1=spark.read.parquet("delme_1.parquet");
> df2=spark.read.parquet("delme_2.parquet");
> df3=df1.join(df2, "id", how="left").cache()   # this cache seems to make 
> a difference
> df4=df3.filter("id<10")
> print(len(df4.columns), df4.count(), df4.cache().count())   # second 
> cache gives different result
> {code}
> Output:
> {noformat}
> 198 10 10
> 199 10 10
> 200 10 10
> 201 12 12
> 202 12 12
> 203 16 16
> 204 10 12
> {noformat}
> Occasionally the middle number is also 10 (expected result) more often. Last 
> column may show different values, but 12 and 16 are common. Sometimes you can 
> try slightly higher num_rows to get this behaviour.
> Spark version is 2.0.0.2.5.0.0-1245 on a Redhat system on a multiple node 
> YARN cluster.
> I am happy to provide more information, if you let me know what is 
> interesting.
> It's not strictly `cache` which is the problem, since `toPandas` and 
> `collect` fall for the same behavior and I basically can hardly get the data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21765) Ensure all leaf nodes that are derived from streaming sources have isStreaming=true

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21765:


Assignee: (was: Apache Spark)

> Ensure all leaf nodes that are derived from streaming sources have 
> isStreaming=true
> ---
>
> Key: SPARK-21765
> URL: https://issues.apache.org/jira/browse/SPARK-21765
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jose Torres
> Fix For: 3.0.0
>
>
> LogicalPlan has an isStreaming bit, but it's incompletely implemented. Some 
> streaming sources don't set the bit, and the bit can sometimes be lost in 
> rewriting. Setting the bit for all plans that are logically streaming will 
> help us simplify the logic around checking query plan validity.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21765) Ensure all leaf nodes that are derived from streaming sources have isStreaming=true

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21765:


Assignee: Apache Spark

> Ensure all leaf nodes that are derived from streaming sources have 
> isStreaming=true
> ---
>
> Key: SPARK-21765
> URL: https://issues.apache.org/jira/browse/SPARK-21765
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jose Torres
>Assignee: Apache Spark
> Fix For: 3.0.0
>
>
> LogicalPlan has an isStreaming bit, but it's incompletely implemented. Some 
> streaming sources don't set the bit, and the bit can sometimes be lost in 
> rewriting. Setting the bit for all plans that are logically streaming will 
> help us simplify the logic around checking query plan validity.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21765) Ensure all leaf nodes that are derived from streaming sources have isStreaming=true

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21765:


Assignee: Apache Spark

> Ensure all leaf nodes that are derived from streaming sources have 
> isStreaming=true
> ---
>
> Key: SPARK-21765
> URL: https://issues.apache.org/jira/browse/SPARK-21765
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jose Torres
>Assignee: Apache Spark
> Fix For: 3.0.0
>
>
> LogicalPlan has an isStreaming bit, but it's incompletely implemented. Some 
> streaming sources don't set the bit, and the bit can sometimes be lost in 
> rewriting. Setting the bit for all plans that are logically streaming will 
> help us simplify the logic around checking query plan validity.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21765) Ensure all leaf nodes that are derived from streaming sources have isStreaming=true

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21765:


Assignee: (was: Apache Spark)

> Ensure all leaf nodes that are derived from streaming sources have 
> isStreaming=true
> ---
>
> Key: SPARK-21765
> URL: https://issues.apache.org/jira/browse/SPARK-21765
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jose Torres
> Fix For: 3.0.0
>
>
> LogicalPlan has an isStreaming bit, but it's incompletely implemented. Some 
> streaming sources don't set the bit, and the bit can sometimes be lost in 
> rewriting. Setting the bit for all plans that are logically streaming will 
> help us simplify the logic around checking query plan validity.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21765) Ensure all leaf nodes that are derived from streaming sources have isStreaming=true

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21765:


Assignee: Apache Spark

> Ensure all leaf nodes that are derived from streaming sources have 
> isStreaming=true
> ---
>
> Key: SPARK-21765
> URL: https://issues.apache.org/jira/browse/SPARK-21765
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jose Torres
>Assignee: Apache Spark
> Fix For: 3.0.0
>
>
> LogicalPlan has an isStreaming bit, but it's incompletely implemented. Some 
> streaming sources don't set the bit, and the bit can sometimes be lost in 
> rewriting. Setting the bit for all plans that are logically streaming will 
> help us simplify the logic around checking query plan validity.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21765) Ensure all leaf nodes that are derived from streaming sources have isStreaming=true

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21765:


Assignee: (was: Apache Spark)

> Ensure all leaf nodes that are derived from streaming sources have 
> isStreaming=true
> ---
>
> Key: SPARK-21765
> URL: https://issues.apache.org/jira/browse/SPARK-21765
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jose Torres
> Fix For: 3.0.0
>
>
> LogicalPlan has an isStreaming bit, but it's incompletely implemented. Some 
> streaming sources don't set the bit, and the bit can sometimes be lost in 
> rewriting. Setting the bit for all plans that are logically streaming will 
> help us simplify the logic around checking query plan validity.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21839) Support SQL config for ORC compression

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21839:


Assignee: (was: Apache Spark)

> Support SQL config for ORC compression 
> ---
>
> Key: SPARK-21839
> URL: https://issues.apache.org/jira/browse/SPARK-21839
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>
> This issue aims to provide `spark.sql.orc.compression.codec` like 
> `spark.sql.parquet.compression.codec`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21839) Support SQL config for ORC compression

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21839:


Assignee: Apache Spark

> Support SQL config for ORC compression 
> ---
>
> Key: SPARK-21839
> URL: https://issues.apache.org/jira/browse/SPARK-21839
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> This issue aims to provide `spark.sql.orc.compression.codec` like 
> `spark.sql.parquet.compression.codec`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20990) Multi-line support for JSON

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20990:


Assignee: (was: Apache Spark)

> Multi-line support for JSON
> ---
>
> Key: SPARK-20990
> URL: https://issues.apache.org/jira/browse/SPARK-20990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>
> When `multiLine` option is on, the existing JSON parser only reads the first 
> record. We should read the other records in the same file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20990) Multi-line support for JSON

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20990:


Assignee: Apache Spark

> Multi-line support for JSON
> ---
>
> Key: SPARK-20990
> URL: https://issues.apache.org/jira/browse/SPARK-20990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> When `multiLine` option is on, the existing JSON parser only reads the first 
> record. We should read the other records in the same file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20990) Multi-line support for JSON

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20990:


Assignee: Apache Spark

> Multi-line support for JSON
> ---
>
> Key: SPARK-20990
> URL: https://issues.apache.org/jira/browse/SPARK-20990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> When `multiLine` option is on, the existing JSON parser only reads the first 
> record. We should read the other records in the same file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20990) Multi-line support for JSON

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20990:


Assignee: (was: Apache Spark)

> Multi-line support for JSON
> ---
>
> Key: SPARK-20990
> URL: https://issues.apache.org/jira/browse/SPARK-20990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>
> When `multiLine` option is on, the existing JSON parser only reads the first 
> record. We should read the other records in the same file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20990) Multi-line support for JSON

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20990:


Assignee: (was: Apache Spark)

> Multi-line support for JSON
> ---
>
> Key: SPARK-20990
> URL: https://issues.apache.org/jira/browse/SPARK-20990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>
> When `multiLine` option is on, the existing JSON parser only reads the first 
> record. We should read the other records in the same file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20990) Multi-line support for JSON

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20990:


Assignee: Apache Spark

> Multi-line support for JSON
> ---
>
> Key: SPARK-20990
> URL: https://issues.apache.org/jira/browse/SPARK-20990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> When `multiLine` option is on, the existing JSON parser only reads the first 
> record. We should read the other records in the same file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17139) Add model summary for MultinomialLogisticRegression

2017-08-28 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-17139.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 15435
[https://github.com/apache/spark/pull/15435]

> Add model summary for MultinomialLogisticRegression
> ---
>
> Key: SPARK-17139
> URL: https://issues.apache.org/jira/browse/SPARK-17139
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Weichen Xu
> Fix For: 2.3.0
>
>
> Add model summary to multinomial logistic regression using same interface as 
> in other ML models.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21764) Tests failures on Windows: resources not being closed and incorrect paths

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21764:


Assignee: (was: Apache Spark)

> Tests failures on Windows: resources not being closed and incorrect paths
> -
>
> Key: SPARK-21764
> URL: https://issues.apache.org/jira/browse/SPARK-21764
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> This is actually a clone of https://issues.apache.org/jira/browse/SPARK-18922 
> but decided to open another one here, targeting 2.3.0 as fixed version.
> In short, there are many test failures on Windows, mainly due to resources 
> not being closed but attempted to be removed (this is failed on Windows) and 
> incorrect path inputs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21764) Tests failures on Windows: resources not being closed and incorrect paths

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21764:


Assignee: Apache Spark

> Tests failures on Windows: resources not being closed and incorrect paths
> -
>
> Key: SPARK-21764
> URL: https://issues.apache.org/jira/browse/SPARK-21764
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> This is actually a clone of https://issues.apache.org/jira/browse/SPARK-18922 
> but decided to open another one here, targeting 2.3.0 as fixed version.
> In short, there are many test failures on Windows, mainly due to resources 
> not being closed but attempted to be removed (this is failed on Windows) and 
> incorrect path inputs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17139) Add model summary for MultinomialLogisticRegression

2017-08-28 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-17139:
-

Assignee: Weichen Xu

> Add model summary for MultinomialLogisticRegression
> ---
>
> Key: SPARK-17139
> URL: https://issues.apache.org/jira/browse/SPARK-17139
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Weichen Xu
>
> Add model summary to multinomial logistic regression using same interface as 
> in other ML models.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21848) Create trait to identify user-defined functions

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21848:


Assignee: (was: Apache Spark)

> Create trait to identify user-defined functions
> ---
>
> Key: SPARK-21848
> URL: https://issues.apache.org/jira/browse/SPARK-21848
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> Create a trait to make it easier for identifying what expressions are 
> user-defined functions



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21848) Create trait to identify user-defined functions

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21848:


Assignee: Apache Spark

> Create trait to identify user-defined functions
> ---
>
> Key: SPARK-21848
> URL: https://issues.apache.org/jira/browse/SPARK-21848
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Minor
>
> Create a trait to make it easier for identifying what expressions are 
> user-defined functions



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21848) Create trait to identify user-defined functions

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21848:


Assignee: Apache Spark

> Create trait to identify user-defined functions
> ---
>
> Key: SPARK-21848
> URL: https://issues.apache.org/jira/browse/SPARK-21848
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Minor
>
> Create a trait to make it easier for identifying what expressions are 
> user-defined functions



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21848) Create trait to identify user-defined functions

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21848:


Assignee: (was: Apache Spark)

> Create trait to identify user-defined functions
> ---
>
> Key: SPARK-21848
> URL: https://issues.apache.org/jira/browse/SPARK-21848
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> Create a trait to make it easier for identifying what expressions are 
> user-defined functions



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21848) Create trait to identify user-defined functions

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21848:


Assignee: Apache Spark

> Create trait to identify user-defined functions
> ---
>
> Key: SPARK-21848
> URL: https://issues.apache.org/jira/browse/SPARK-21848
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Minor
>
> Create a trait to make it easier for identifying what expressions are 
> user-defined functions



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21848) Create trait to identify user-defined functions

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21848:


Assignee: (was: Apache Spark)

> Create trait to identify user-defined functions
> ---
>
> Key: SPARK-21848
> URL: https://issues.apache.org/jira/browse/SPARK-21848
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> Create a trait to make it easier for identifying what expressions are 
> user-defined functions



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21764) Tests failures on Windows: resources not being closed and incorrect paths

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21764:


Assignee: Apache Spark

> Tests failures on Windows: resources not being closed and incorrect paths
> -
>
> Key: SPARK-21764
> URL: https://issues.apache.org/jira/browse/SPARK-21764
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> This is actually a clone of https://issues.apache.org/jira/browse/SPARK-18922 
> but decided to open another one here, targeting 2.3.0 as fixed version.
> In short, there are many test failures on Windows, mainly due to resources 
> not being closed but attempted to be removed (this is failed on Windows) and 
> incorrect path inputs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21764) Tests failures on Windows: resources not being closed and incorrect paths

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21764:


Assignee: (was: Apache Spark)

> Tests failures on Windows: resources not being closed and incorrect paths
> -
>
> Key: SPARK-21764
> URL: https://issues.apache.org/jira/browse/SPARK-21764
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> This is actually a clone of https://issues.apache.org/jira/browse/SPARK-18922 
> but decided to open another one here, targeting 2.3.0 as fixed version.
> In short, there are many test failures on Windows, mainly due to resources 
> not being closed but attempted to be removed (this is failed on Windows) and 
> incorrect path inputs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21848) Create trait to identify user-defined functions

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21848:


Assignee: Apache Spark

> Create trait to identify user-defined functions
> ---
>
> Key: SPARK-21848
> URL: https://issues.apache.org/jira/browse/SPARK-21848
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Minor
>
> Create a trait to make it easier for identifying what expressions are 
> user-defined functions



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21848) Create trait to identify user-defined functions

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21848:


Assignee: (was: Apache Spark)

> Create trait to identify user-defined functions
> ---
>
> Key: SPARK-21848
> URL: https://issues.apache.org/jira/browse/SPARK-21848
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> Create a trait to make it easier for identifying what expressions are 
> user-defined functions



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21839) Support SQL config for ORC compression

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21839:


Assignee: Apache Spark

> Support SQL config for ORC compression 
> ---
>
> Key: SPARK-21839
> URL: https://issues.apache.org/jira/browse/SPARK-21839
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> This issue aims to provide `spark.sql.orc.compression.codec` like 
> `spark.sql.parquet.compression.codec`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21839) Support SQL config for ORC compression

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21839:


Assignee: (was: Apache Spark)

> Support SQL config for ORC compression 
> ---
>
> Key: SPARK-21839
> URL: https://issues.apache.org/jira/browse/SPARK-21839
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>
> This issue aims to provide `spark.sql.orc.compression.codec` like 
> `spark.sql.parquet.compression.codec`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21839) Support SQL config for ORC compression

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21839:


Assignee: Apache Spark

> Support SQL config for ORC compression 
> ---
>
> Key: SPARK-21839
> URL: https://issues.apache.org/jira/browse/SPARK-21839
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> This issue aims to provide `spark.sql.orc.compression.codec` like 
> `spark.sql.parquet.compression.codec`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21839) Support SQL config for ORC compression

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21839:


Assignee: (was: Apache Spark)

> Support SQL config for ORC compression 
> ---
>
> Key: SPARK-21839
> URL: https://issues.apache.org/jira/browse/SPARK-21839
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>
> This issue aims to provide `spark.sql.orc.compression.codec` like 
> `spark.sql.parquet.compression.codec`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21839) Support SQL config for ORC compression

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21839:


Assignee: (was: Apache Spark)

> Support SQL config for ORC compression 
> ---
>
> Key: SPARK-21839
> URL: https://issues.apache.org/jira/browse/SPARK-21839
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>
> This issue aims to provide `spark.sql.orc.compression.codec` like 
> `spark.sql.parquet.compression.codec`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21839) Support SQL config for ORC compression

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21839:


Assignee: Apache Spark

> Support SQL config for ORC compression 
> ---
>
> Key: SPARK-21839
> URL: https://issues.apache.org/jira/browse/SPARK-21839
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> This issue aims to provide `spark.sql.orc.compression.codec` like 
> `spark.sql.parquet.compression.codec`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21829) Enable config to permanently blacklist a list of nodes

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21829:


Assignee: Apache Spark

> Enable config to permanently blacklist a list of nodes
> --
>
> Key: SPARK-21829
> URL: https://issues.apache.org/jira/browse/SPARK-21829
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler, Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Luca Canali
>Assignee: Apache Spark
>Priority: Minor
>
> The idea for this proposal comes from a performance incident in a local 
> cluster where a job was found very slow because of a log tail of stragglers 
> due to 2 nodes in the cluster being slow to access a remote filesystem.
> The issue was limited to the 2 machines and was related to external 
> configurations: the 2 machines that performed badly when accessing the remote 
> file system were behaving normally for other jobs in the cluster (a shared 
> YARN cluster).
> With this new feature I propose to introduce a mechanism to allow users to 
> specify a list of nodes in the cluster where executors/tasks should not run 
> for a specific job.
> The proposed implementation that I tested (see PR) uses the Spark blacklist 
> mechanism. With the parameter spark.blacklist.alwaysBlacklistedNodes, a list 
> of user-specified nodes is added to the blacklist at the start of the Spark 
> Context and it is never expired. 
> I have tested this on a YARN cluster on a case taken from the original 
> production problem and I confirm a performance improvement of about 5x for 
> the specific test case I have. I imagine that there can be other cases where 
> Spark users may want to blacklist a set of nodes. This can be used for 
> troubleshooting, including cases where certain nodes/executors are slow for a 
> given workload and this is caused by external agents, so the anomaly is not 
> picked up by the cluster manager.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21829) Enable config to permanently blacklist a list of nodes

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21829:


Assignee: (was: Apache Spark)

> Enable config to permanently blacklist a list of nodes
> --
>
> Key: SPARK-21829
> URL: https://issues.apache.org/jira/browse/SPARK-21829
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler, Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Luca Canali
>Priority: Minor
>
> The idea for this proposal comes from a performance incident in a local 
> cluster where a job was found very slow because of a log tail of stragglers 
> due to 2 nodes in the cluster being slow to access a remote filesystem.
> The issue was limited to the 2 machines and was related to external 
> configurations: the 2 machines that performed badly when accessing the remote 
> file system were behaving normally for other jobs in the cluster (a shared 
> YARN cluster).
> With this new feature I propose to introduce a mechanism to allow users to 
> specify a list of nodes in the cluster where executors/tasks should not run 
> for a specific job.
> The proposed implementation that I tested (see PR) uses the Spark blacklist 
> mechanism. With the parameter spark.blacklist.alwaysBlacklistedNodes, a list 
> of user-specified nodes is added to the blacklist at the start of the Spark 
> Context and it is never expired. 
> I have tested this on a YARN cluster on a case taken from the original 
> production problem and I confirm a performance improvement of about 5x for 
> the specific test case I have. I imagine that there can be other cases where 
> Spark users may want to blacklist a set of nodes. This can be used for 
> troubleshooting, including cases where certain nodes/executors are slow for a 
> given workload and this is caused by external agents, so the anomaly is not 
> picked up by the cluster manager.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21513) SQL to_json should support all column types

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21513:


Assignee: Apache Spark

> SQL to_json should support all column types
> ---
>
> Key: SPARK-21513
> URL: https://issues.apache.org/jira/browse/SPARK-21513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Aaron Davidson
>Assignee: Apache Spark
>  Labels: Starter
>
> The built-in SQL UDF "to_json" currently supports serializing StructType 
> columns, as well as Arrays of StructType columns. If you attempt to use it on 
> a different type, for example a map, you get an error like this:
> {code}
> AnalysisException: cannot resolve 'structstojson(`tags`)' due to data type 
> mismatch: Input type map must be a struct or array of 
> structs.;;
> {code}
> This limitation seems arbitrary; if I were to go through the effort of 
> enclosing my map in a struct, it would be serializable. Same thing with any 
> other non-struct type.
> Therefore the desired improvement is to allow to_json to operate directly on 
> any column type. The associated code is 
> [here|https://github.com/apache/spark/blob/86174ea89b39a300caaba6baffac70f3dc702788/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L653].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21513) SQL to_json should support all column types

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21513:


Assignee: (was: Apache Spark)

> SQL to_json should support all column types
> ---
>
> Key: SPARK-21513
> URL: https://issues.apache.org/jira/browse/SPARK-21513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Aaron Davidson
>  Labels: Starter
>
> The built-in SQL UDF "to_json" currently supports serializing StructType 
> columns, as well as Arrays of StructType columns. If you attempt to use it on 
> a different type, for example a map, you get an error like this:
> {code}
> AnalysisException: cannot resolve 'structstojson(`tags`)' due to data type 
> mismatch: Input type map must be a struct or array of 
> structs.;;
> {code}
> This limitation seems arbitrary; if I were to go through the effort of 
> enclosing my map in a struct, it would be serializable. Same thing with any 
> other non-struct type.
> Therefore the desired improvement is to allow to_json to operate directly on 
> any column type. The associated code is 
> [here|https://github.com/apache/spark/blob/86174ea89b39a300caaba6baffac70f3dc702788/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L653].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21513) SQL to_json should support all column types

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21513:


Assignee: Apache Spark

> SQL to_json should support all column types
> ---
>
> Key: SPARK-21513
> URL: https://issues.apache.org/jira/browse/SPARK-21513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Aaron Davidson
>Assignee: Apache Spark
>  Labels: Starter
>
> The built-in SQL UDF "to_json" currently supports serializing StructType 
> columns, as well as Arrays of StructType columns. If you attempt to use it on 
> a different type, for example a map, you get an error like this:
> {code}
> AnalysisException: cannot resolve 'structstojson(`tags`)' due to data type 
> mismatch: Input type map must be a struct or array of 
> structs.;;
> {code}
> This limitation seems arbitrary; if I were to go through the effort of 
> enclosing my map in a struct, it would be serializable. Same thing with any 
> other non-struct type.
> Therefore the desired improvement is to allow to_json to operate directly on 
> any column type. The associated code is 
> [here|https://github.com/apache/spark/blob/86174ea89b39a300caaba6baffac70f3dc702788/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L653].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21513) SQL to_json should support all column types

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21513:


Assignee: (was: Apache Spark)

> SQL to_json should support all column types
> ---
>
> Key: SPARK-21513
> URL: https://issues.apache.org/jira/browse/SPARK-21513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Aaron Davidson
>  Labels: Starter
>
> The built-in SQL UDF "to_json" currently supports serializing StructType 
> columns, as well as Arrays of StructType columns. If you attempt to use it on 
> a different type, for example a map, you get an error like this:
> {code}
> AnalysisException: cannot resolve 'structstojson(`tags`)' due to data type 
> mismatch: Input type map must be a struct or array of 
> structs.;;
> {code}
> This limitation seems arbitrary; if I were to go through the effort of 
> enclosing my map in a struct, it would be serializable. Same thing with any 
> other non-struct type.
> Therefore the desired improvement is to allow to_json to operate directly on 
> any column type. The associated code is 
> [here|https://github.com/apache/spark/blob/86174ea89b39a300caaba6baffac70f3dc702788/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L653].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21513) SQL to_json should support all column types

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21513:


Assignee: Apache Spark

> SQL to_json should support all column types
> ---
>
> Key: SPARK-21513
> URL: https://issues.apache.org/jira/browse/SPARK-21513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Aaron Davidson
>Assignee: Apache Spark
>  Labels: Starter
>
> The built-in SQL UDF "to_json" currently supports serializing StructType 
> columns, as well as Arrays of StructType columns. If you attempt to use it on 
> a different type, for example a map, you get an error like this:
> {code}
> AnalysisException: cannot resolve 'structstojson(`tags`)' due to data type 
> mismatch: Input type map must be a struct or array of 
> structs.;;
> {code}
> This limitation seems arbitrary; if I were to go through the effort of 
> enclosing my map in a struct, it would be serializable. Same thing with any 
> other non-struct type.
> Therefore the desired improvement is to allow to_json to operate directly on 
> any column type. The associated code is 
> [here|https://github.com/apache/spark/blob/86174ea89b39a300caaba6baffac70f3dc702788/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L653].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21513) SQL to_json should support all column types

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21513:


Assignee: (was: Apache Spark)

> SQL to_json should support all column types
> ---
>
> Key: SPARK-21513
> URL: https://issues.apache.org/jira/browse/SPARK-21513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Aaron Davidson
>  Labels: Starter
>
> The built-in SQL UDF "to_json" currently supports serializing StructType 
> columns, as well as Arrays of StructType columns. If you attempt to use it on 
> a different type, for example a map, you get an error like this:
> {code}
> AnalysisException: cannot resolve 'structstojson(`tags`)' due to data type 
> mismatch: Input type map must be a struct or array of 
> structs.;;
> {code}
> This limitation seems arbitrary; if I were to go through the effort of 
> enclosing my map in a struct, it would be serializable. Same thing with any 
> other non-struct type.
> Therefore the desired improvement is to allow to_json to operate directly on 
> any column type. The associated code is 
> [here|https://github.com/apache/spark/blob/86174ea89b39a300caaba6baffac70f3dc702788/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L653].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21835) RewritePredicateSubquery should not produce unresolved query plans

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21835:


Assignee: Apache Spark

> RewritePredicateSubquery should not produce unresolved query plans
> --
>
> Key: SPARK-21835
> URL: https://issues.apache.org/jira/browse/SPARK-21835
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> {{RewritePredicateSubquery}} rewrites correlated subquery to join operations. 
> During the structural integrity, I found {[RewritePredicateSubquery}} can 
> produce unresolved query plans due to conflicting attributes. We should not 
> let {{RewritePredicateSubquery}} produce unresolved plans.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21835) RewritePredicateSubquery should not produce unresolved query plans

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21835:


Assignee: (was: Apache Spark)

> RewritePredicateSubquery should not produce unresolved query plans
> --
>
> Key: SPARK-21835
> URL: https://issues.apache.org/jira/browse/SPARK-21835
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>
> {{RewritePredicateSubquery}} rewrites correlated subquery to join operations. 
> During the structural integrity, I found {[RewritePredicateSubquery}} can 
> produce unresolved query plans due to conflicting attributes. We should not 
> let {{RewritePredicateSubquery}} produce unresolved plans.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21835) RewritePredicateSubquery should not produce unresolved query plans

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21835:


Assignee: (was: Apache Spark)

> RewritePredicateSubquery should not produce unresolved query plans
> --
>
> Key: SPARK-21835
> URL: https://issues.apache.org/jira/browse/SPARK-21835
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>
> {{RewritePredicateSubquery}} rewrites correlated subquery to join operations. 
> During the structural integrity, I found {[RewritePredicateSubquery}} can 
> produce unresolved query plans due to conflicting attributes. We should not 
> let {{RewritePredicateSubquery}} produce unresolved plans.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21835) RewritePredicateSubquery should not produce unresolved query plans

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21835:


Assignee: Apache Spark

> RewritePredicateSubquery should not produce unresolved query plans
> --
>
> Key: SPARK-21835
> URL: https://issues.apache.org/jira/browse/SPARK-21835
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> {{RewritePredicateSubquery}} rewrites correlated subquery to join operations. 
> During the structural integrity, I found {[RewritePredicateSubquery}} can 
> produce unresolved query plans due to conflicting attributes. We should not 
> let {{RewritePredicateSubquery}} produce unresolved plans.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21835) RewritePredicateSubquery should not produce unresolved query plans

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21835:


Assignee: (was: Apache Spark)

> RewritePredicateSubquery should not produce unresolved query plans
> --
>
> Key: SPARK-21835
> URL: https://issues.apache.org/jira/browse/SPARK-21835
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>
> {{RewritePredicateSubquery}} rewrites correlated subquery to join operations. 
> During the structural integrity, I found {[RewritePredicateSubquery}} can 
> produce unresolved query plans due to conflicting attributes. We should not 
> let {{RewritePredicateSubquery}} produce unresolved plans.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21835) RewritePredicateSubquery should not produce unresolved query plans

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21835:


Assignee: Apache Spark

> RewritePredicateSubquery should not produce unresolved query plans
> --
>
> Key: SPARK-21835
> URL: https://issues.apache.org/jira/browse/SPARK-21835
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> {{RewritePredicateSubquery}} rewrites correlated subquery to join operations. 
> During the structural integrity, I found {[RewritePredicateSubquery}} can 
> produce unresolved query plans due to conflicting attributes. We should not 
> let {{RewritePredicateSubquery}} produce unresolved plans.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21834) Incorrect executor request in case of dynamic allocation

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21834:


Assignee: Apache Spark

> Incorrect executor request in case of dynamic allocation
> 
>
> Key: SPARK-21834
> URL: https://issues.apache.org/jira/browse/SPARK-21834
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Assignee: Apache Spark
>
> killExecutor api currently does not allow killing an executor without 
> updating the total number of executors needed. In case of dynamic allocation 
> is turned on and the allocator tries to kill an executor, the scheduler 
> reduces the total number of executors needed ( see 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L635)
>  which is incorrect because the allocator already takes care of setting the 
> required number of executors itself. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21834) Incorrect executor request in case of dynamic allocation

2017-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21834:


Assignee: (was: Apache Spark)

> Incorrect executor request in case of dynamic allocation
> 
>
> Key: SPARK-21834
> URL: https://issues.apache.org/jira/browse/SPARK-21834
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>
> killExecutor api currently does not allow killing an executor without 
> updating the total number of executors needed. In case of dynamic allocation 
> is turned on and the allocator tries to kill an executor, the scheduler 
> reduces the total number of executors needed ( see 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L635)
>  which is incorrect because the allocator already takes care of setting the 
> required number of executors itself. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19662) Add Fair Scheduler Unit Test coverage for different build cases

2017-08-28 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-19662:


Assignee: Eren Avsarogullari

> Add Fair Scheduler Unit Test coverage for different build cases
> ---
>
> Key: SPARK-19662
> URL: https://issues.apache.org/jira/browse/SPARK-19662
> Project: Spark
>  Issue Type: Test
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Eren Avsarogullari
>Assignee: Eren Avsarogullari
>Priority: Minor
> Fix For: 2.3.0
>
>
> Fair Scheduler can be built via one of the following options:
> - By setting a {{spark.scheduler.allocation.file}} property 
> - By setting {{fairscheduler.xml}} into classpath, 
> These options are checked in order and fair-scheduler is built via first 
> found one. If invalid path is found, {{FileNotFoundException}} will be 
> expected.
> Related PR aims unit test coverage of these use cases and a minor 
> documentation change has been added for second option({{fairscheduler.xml}} 
> into classpath) to inform the user.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   6   >