[jira] [Commented] (SPARK-26000) Missing block when reading HDFS Data from Cloudera Manager

2018-11-09 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682259#comment-16682259
 ] 

Yuming Wang commented on SPARK-26000:
-

It is not a Spark issue, Maybe you need to increase 
{{dfs.datanode.handler.count}}.

> Missing block when reading HDFS Data from Cloudera Manager
> --
>
> Key: SPARK-26000
> URL: https://issues.apache.org/jira/browse/SPARK-26000
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2
>Reporter: john
>Priority: Major
>
> I am able to write to Cloudera Manager HDFS through Open Source Spark which 
> runs separately. but not able to read the Cloudera Manger HDFS data .
>  
> I am getting missing block location, socketTimeOut.
>  
> spark.read().textfile(path_to_file)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26001) Reduce memory copy when writing decimal

2018-11-09 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-26001:

Affects Version/s: (was: 2.5.0)
   3.0.0

> Reduce memory copy when writing decimal
> ---
>
> Key: SPARK-26001
> URL: https://issues.apache.org/jira/browse/SPARK-26001
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: caoxuewen
>Priority: Major
>
> this PR fix 2 here:
> - when writing non-null decimals, we not zero-out all the 16 allocated bytes. 
> if the number of bytes needed for a decimal is greater than 8. then we not 
> need zero-out between 0-byte and 8-byte. The first 8-byte will be covered 
> when writing decimal.
> - when writing null decimals, we not zero-out all the 16 allocated bytes. 
> BitSetMethods.set the label for null and the length of decimal to 0. when we 
> get the decimal, will not access the 16 byte memory value, so this is safe.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25102) Write Spark version to ORC/Parquet file metadata

2018-11-09 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25102.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 3.0.0

> Write Spark version to ORC/Parquet file metadata
> 
>
> Key: SPARK-25102
> URL: https://issues.apache.org/jira/browse/SPARK-25102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zoltan Ivanfi
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, Spark writes Spark version number into Hive Table properties with 
> `spark.sql.create.version`.
> {code}
> parameters:{
>   spark.sql.sources.schema.part.0={
> "type":"struct",
> "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
>   },
>   transient_lastDdlTime=1541142761, 
>   spark.sql.sources.schema.numParts=1,
>   spark.sql.create.version=2.4.0
> }
> {code}
> This issue aims to write Spark versions to ORC/Parquet file metadata with 
> `org.apache.spark.sql.create.version`. It's different from Hive Table 
> property key `spark.sql.create.version`. It seems that we cannot change that 
> for backward compatibility (even in Apache Spark 3.0)
> *ORC*
> {code}
> User Metadata:
>   org.apache.spark.sql.create.version=3.0.0-SNAPSHOT
> {code}
> *PARQUET*
> {code}
> file:
> file:/tmp/p/part-7-9dc415fe-7773-49ba-9c59-4c151e16009a-c000.snappy.parquet
> creator: parquet-mr version 1.10.0 (build 
> 031a6654009e3b82020012a18434c582bd74c73a)
> extra:   org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT
> extra:   org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26001) Reduce memory copy when writing decimal

2018-11-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26001:


Assignee: (was: Apache Spark)

> Reduce memory copy when writing decimal
> ---
>
> Key: SPARK-26001
> URL: https://issues.apache.org/jira/browse/SPARK-26001
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: caoxuewen
>Priority: Major
>
> this PR fix 2 here:
> - when writing non-null decimals, we not zero-out all the 16 allocated bytes. 
> if the number of bytes needed for a decimal is greater than 8. then we not 
> need zero-out between 0-byte and 8-byte. The first 8-byte will be covered 
> when writing decimal.
> - when writing null decimals, we not zero-out all the 16 allocated bytes. 
> BitSetMethods.set the label for null and the length of decimal to 0. when we 
> get the decimal, will not access the 16 byte memory value, so this is safe.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26001) Reduce memory copy when writing decimal

2018-11-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682238#comment-16682238
 ] 

Apache Spark commented on SPARK-26001:
--

User 'heary-cao' has created a pull request for this issue:
https://github.com/apache/spark/pull/22998

> Reduce memory copy when writing decimal
> ---
>
> Key: SPARK-26001
> URL: https://issues.apache.org/jira/browse/SPARK-26001
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: caoxuewen
>Priority: Major
>
> this PR fix 2 here:
> - when writing non-null decimals, we not zero-out all the 16 allocated bytes. 
> if the number of bytes needed for a decimal is greater than 8. then we not 
> need zero-out between 0-byte and 8-byte. The first 8-byte will be covered 
> when writing decimal.
> - when writing null decimals, we not zero-out all the 16 allocated bytes. 
> BitSetMethods.set the label for null and the length of decimal to 0. when we 
> get the decimal, will not access the 16 byte memory value, so this is safe.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26001) Reduce memory copy when writing decimal

2018-11-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26001:


Assignee: Apache Spark

> Reduce memory copy when writing decimal
> ---
>
> Key: SPARK-26001
> URL: https://issues.apache.org/jira/browse/SPARK-26001
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: caoxuewen
>Assignee: Apache Spark
>Priority: Major
>
> this PR fix 2 here:
> - when writing non-null decimals, we not zero-out all the 16 allocated bytes. 
> if the number of bytes needed for a decimal is greater than 8. then we not 
> need zero-out between 0-byte and 8-byte. The first 8-byte will be covered 
> when writing decimal.
> - when writing null decimals, we not zero-out all the 16 allocated bytes. 
> BitSetMethods.set the label for null and the length of decimal to 0. when we 
> get the decimal, will not access the 16 byte memory value, so this is safe.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26001) Reduce memory copy when writing decimal

2018-11-09 Thread caoxuewen (JIRA)
caoxuewen created SPARK-26001:
-

 Summary: Reduce memory copy when writing decimal
 Key: SPARK-26001
 URL: https://issues.apache.org/jira/browse/SPARK-26001
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.5.0
Reporter: caoxuewen


this PR fix 2 here:
- when writing non-null decimals, we not zero-out all the 16 allocated bytes. 
if the number of bytes needed for a decimal is greater than 8. then we not need 
zero-out between 0-byte and 8-byte. The first 8-byte will be covered when 
writing decimal.
- when writing null decimals, we not zero-out all the 16 allocated bytes. 
BitSetMethods.set the label for null and the length of decimal to 0. when we 
get the decimal, will not access the 16 byte memory value, so this is safe.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26000) Missing block when reading HDFS Data from Cloudera Manager

2018-11-09 Thread john (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682163#comment-16682163
 ] 

john commented on SPARK-26000:
--

I have Cloudera Manager in Environment A which has HDFS component and Spark in 
B. I am doing a very sample read and write to/from HDFS. Writing to HDFS 
Cloudera Manager is working as expected when reading back i m getting below 
issues:

 

"java.lang.reflect.InvocationTargetException" Caused By: 
"org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It 
must be specified manually.;"

Caused By: "java.net.SocketTimeoutException: 6 millis timeout while waiting 
for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/SparkNode_IP_PORT_NoO 
remote=/NameNode:50010:"

java Sample code

 

// writing 

spark.write().mode("append").format("parquet").save(path_to_file);

// read

spark.read().parquet(path_to_file);

 

 

 

 

> Missing block when reading HDFS Data from Cloudera Manager
> --
>
> Key: SPARK-26000
> URL: https://issues.apache.org/jira/browse/SPARK-26000
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2
>Reporter: john
>Priority: Major
>
> I am able to write to Cloudera Manager HDFS through Open Source Spark which 
> runs separately. but not able to read the Cloudera Manger HDFS data .
>  
> I am getting missing block location, socketTimeOut.
>  
> spark.read().textfile(path_to_file)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26000) Missing block when reading HDFS Data from Cloudera Manager

2018-11-09 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682151#comment-16682151
 ] 

Yuming Wang commented on SPARK-26000:
-

Could you provide more information?

> Missing block when reading HDFS Data from Cloudera Manager
> --
>
> Key: SPARK-26000
> URL: https://issues.apache.org/jira/browse/SPARK-26000
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2
>Reporter: john
>Priority: Major
>
> I am able to write to Cloudera Manager HDFS through Open Source Spark which 
> runs separately. but not able to read the Cloudera Manger HDFS data .
>  
> I am getting missing block location, socketTimeOut.
>  
> spark.read().textfile(path_to_file)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26000) Missing block when reading HDFS Data from Cloudera Manager

2018-11-09 Thread john (JIRA)
john created SPARK-26000:


 Summary: Missing block when reading HDFS Data from Cloudera Manager
 Key: SPARK-26000
 URL: https://issues.apache.org/jira/browse/SPARK-26000
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.2
Reporter: john


I am able to write to Cloudera Manager HDFS through Open Source Spark which 
runs separately. but not able to read the Cloudera Manger HDFS data .

 

I am getting missing block location, socketTimeOut.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25993) Add test cases for resolution of ORC table location

2018-11-09 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682141#comment-16682141
 ] 

kevin yu commented on SPARK-25993:
--

I am looking into it now. Kevin

> Add test cases for resolution of ORC table location
> ---
>
> Key: SPARK-25993
> URL: https://issues.apache.org/jira/browse/SPARK-25993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.2
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> Add a test case based on the following example. The behavior was changed in 
> 2.3 release. We also need to upgrade the migration guide.
> {code:java}
> val someDF1 = Seq(
>   (1, 1, "blah"),
>   (1, 2, "blahblah")
> ).toDF("folder", "number", "word").repartition(1)
> someDF1.write.orc("/tmp/orctab1/dir1/")
> someDF1.write.orc("/mnt/orctab1/dir2/")
> create external table tab1(folder int,number int,word string) STORED AS ORC 
> LOCATION '/tmp/orctab1/");
> select * from tab1;
> create external table tab2(folder int,number int,word string) STORED AS ORC 
> LOCATION '/tmp/orctab1/*");
> select * from tab2;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24229) Upgrade to the latest Apache Thrift 0.10.0 release

2018-11-09 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682137#comment-16682137
 ] 

Dongjoon Hyun commented on SPARK-24229:
---

I also agree with [~vanzin] 's opinion.

Since this is open, [~Fokko] already made a PR to this.

I'll close this for now. Please reopen this with a reproducible test case.

> Upgrade to the latest Apache Thrift 0.10.0 release
> --
>
> Key: SPARK-24229
> URL: https://issues.apache.org/jira/browse/SPARK-24229
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.0
>Reporter: Ray Donnelly
>Priority: Critical
>
> According to [https://www.cvedetails.com/cve/CVE-2016-5397/]
>  
> .. there are critical vulnerabilities in libthrift 0.9.3 currently vendored 
> in Apache Spark (and then, for us, into PySpark).
>  
> Can anyone help to assess the seriousness of this and what should be done 
> about it?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26000) Missing block when reading HDFS Data from Cloudera Manager

2018-11-09 Thread john (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

john updated SPARK-26000:
-
Description: 
I am able to write to Cloudera Manager HDFS through Open Source Spark which 
runs separately. but not able to read the Cloudera Manger HDFS data .

 

I am getting missing block location, socketTimeOut.

 

spark.read().textfile(path_to_file)

  was:
I am able to write to Cloudera Manager HDFS through Open Source Spark which 
runs separately. but not able to read the Cloudera Manger HDFS data .

 

I am getting missing block location, socketTimeOut.


> Missing block when reading HDFS Data from Cloudera Manager
> --
>
> Key: SPARK-26000
> URL: https://issues.apache.org/jira/browse/SPARK-26000
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2
>Reporter: john
>Priority: Major
>
> I am able to write to Cloudera Manager HDFS through Open Source Spark which 
> runs separately. but not able to read the Cloudera Manger HDFS data .
>  
> I am getting missing block location, socketTimeOut.
>  
> spark.read().textfile(path_to_file)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24229) Upgrade to the latest Apache Thrift 0.10.0 release

2018-11-09 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682137#comment-16682137
 ] 

Dongjoon Hyun edited comment on SPARK-24229 at 11/10/18 1:40 AM:
-

I also agree with [~vanzin] 's opinion.

Since this is open, [~Fokko] already made a PR to this.

I'll close this for now to save Apache Spark community effort. Please reopen 
this with a reproducible test case.


was (Author: dongjoon):
I also agree with [~vanzin] 's opinion.

Since this is open, [~Fokko] already made a PR to this.

I'll close this for now. Please reopen this with a reproducible test case.

> Upgrade to the latest Apache Thrift 0.10.0 release
> --
>
> Key: SPARK-24229
> URL: https://issues.apache.org/jira/browse/SPARK-24229
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.0
>Reporter: Ray Donnelly
>Priority: Critical
>
> According to [https://www.cvedetails.com/cve/CVE-2016-5397/]
>  
> .. there are critical vulnerabilities in libthrift 0.9.3 currently vendored 
> in Apache Spark (and then, for us, into PySpark).
>  
> Can anyone help to assess the seriousness of this and what should be done 
> about it?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24229) Upgrade to the latest Apache Thrift 0.10.0 release

2018-11-09 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-24229.
---
Resolution: Not A Problem

> Upgrade to the latest Apache Thrift 0.10.0 release
> --
>
> Key: SPARK-24229
> URL: https://issues.apache.org/jira/browse/SPARK-24229
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.0
>Reporter: Ray Donnelly
>Priority: Critical
>
> According to [https://www.cvedetails.com/cve/CVE-2016-5397/]
>  
> .. there are critical vulnerabilities in libthrift 0.9.3 currently vendored 
> in Apache Spark (and then, for us, into PySpark).
>  
> Can anyone help to assess the seriousness of this and what should be done 
> about it?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25999) make-distribution.sh failure with --r and -Phadoop-provided

2018-11-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25999:


Assignee: (was: Apache Spark)

> make-distribution.sh failure with --r and -Phadoop-provided
> ---
>
> Key: SPARK-25999
> URL: https://issues.apache.org/jira/browse/SPARK-25999
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.2, 2.4.0
>Reporter: shanyu zhao
>Priority: Major
> Attachments: SPARK-25999.patch
>
>
> It is not possible to build a distribution that doesn't contain hadoop 
> dependencies but include SparkR. This is because R/check_cran.sh builds R 
> document which depends on hadoop dependencies in 
> assembly/target/scala-xxx/jars folder.
> To reproduce:
> MAVEN_BUILD_OPTS="-Dmaven.javadoc.skip=true -Pyarn -Phadoop-2.7 -Phive 
> -Psparkr -Phadoop-provided"
> ./dev/make-distribution.sh --tgz --r $MAVEN_BUILD_OPTS
>  
> Error:
> * creating vignettes ... ERROR
> ...
> Error: A JNI error has occurred, please check your installation and try again
> Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25999) make-distribution.sh failure with --r and -Phadoop-provided

2018-11-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682129#comment-16682129
 ] 

Apache Spark commented on SPARK-25999:
--

User 'shanyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/22997

> make-distribution.sh failure with --r and -Phadoop-provided
> ---
>
> Key: SPARK-25999
> URL: https://issues.apache.org/jira/browse/SPARK-25999
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.2, 2.4.0
>Reporter: shanyu zhao
>Priority: Major
> Attachments: SPARK-25999.patch
>
>
> It is not possible to build a distribution that doesn't contain hadoop 
> dependencies but include SparkR. This is because R/check_cran.sh builds R 
> document which depends on hadoop dependencies in 
> assembly/target/scala-xxx/jars folder.
> To reproduce:
> MAVEN_BUILD_OPTS="-Dmaven.javadoc.skip=true -Pyarn -Phadoop-2.7 -Phive 
> -Psparkr -Phadoop-provided"
> ./dev/make-distribution.sh --tgz --r $MAVEN_BUILD_OPTS
>  
> Error:
> * creating vignettes ... ERROR
> ...
> Error: A JNI error has occurred, please check your installation and try again
> Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25999) make-distribution.sh failure with --r and -Phadoop-provided

2018-11-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682126#comment-16682126
 ] 

Apache Spark commented on SPARK-25999:
--

User 'shanyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/22997

> make-distribution.sh failure with --r and -Phadoop-provided
> ---
>
> Key: SPARK-25999
> URL: https://issues.apache.org/jira/browse/SPARK-25999
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.2, 2.4.0
>Reporter: shanyu zhao
>Priority: Major
> Attachments: SPARK-25999.patch
>
>
> It is not possible to build a distribution that doesn't contain hadoop 
> dependencies but include SparkR. This is because R/check_cran.sh builds R 
> document which depends on hadoop dependencies in 
> assembly/target/scala-xxx/jars folder.
> To reproduce:
> MAVEN_BUILD_OPTS="-Dmaven.javadoc.skip=true -Pyarn -Phadoop-2.7 -Phive 
> -Psparkr -Phadoop-provided"
> ./dev/make-distribution.sh --tgz --r $MAVEN_BUILD_OPTS
>  
> Error:
> * creating vignettes ... ERROR
> ...
> Error: A JNI error has occurred, please check your installation and try again
> Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25999) make-distribution.sh failure with --r and -Phadoop-provided

2018-11-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25999:


Assignee: Apache Spark

> make-distribution.sh failure with --r and -Phadoop-provided
> ---
>
> Key: SPARK-25999
> URL: https://issues.apache.org/jira/browse/SPARK-25999
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.2, 2.4.0
>Reporter: shanyu zhao
>Assignee: Apache Spark
>Priority: Major
> Attachments: SPARK-25999.patch
>
>
> It is not possible to build a distribution that doesn't contain hadoop 
> dependencies but include SparkR. This is because R/check_cran.sh builds R 
> document which depends on hadoop dependencies in 
> assembly/target/scala-xxx/jars folder.
> To reproduce:
> MAVEN_BUILD_OPTS="-Dmaven.javadoc.skip=true -Pyarn -Phadoop-2.7 -Phive 
> -Psparkr -Phadoop-provided"
> ./dev/make-distribution.sh --tgz --r $MAVEN_BUILD_OPTS
>  
> Error:
> * creating vignettes ... ERROR
> ...
> Error: A JNI error has occurred, please check your installation and try again
> Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25999) make-distribution.sh failure with --r and -Phadoop-provided

2018-11-09 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682122#comment-16682122
 ] 

Yuming Wang commented on SPARK-25999:
-

Please create pull request at: https://github.com/apache/spark/pulls


> make-distribution.sh failure with --r and -Phadoop-provided
> ---
>
> Key: SPARK-25999
> URL: https://issues.apache.org/jira/browse/SPARK-25999
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.2, 2.4.0
>Reporter: shanyu zhao
>Priority: Major
> Attachments: SPARK-25999.patch
>
>
> It is not possible to build a distribution that doesn't contain hadoop 
> dependencies but include SparkR. This is because R/check_cran.sh builds R 
> document which depends on hadoop dependencies in 
> assembly/target/scala-xxx/jars folder.
> To reproduce:
> MAVEN_BUILD_OPTS="-Dmaven.javadoc.skip=true -Pyarn -Phadoop-2.7 -Phive 
> -Psparkr -Phadoop-provided"
> ./dev/make-distribution.sh --tgz --r $MAVEN_BUILD_OPTS
>  
> Error:
> * creating vignettes ... ERROR
> ...
> Error: A JNI error has occurred, please check your installation and try again
> Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25999) make-distribution.sh failure with --r and -Phadoop-provided

2018-11-09 Thread shanyu zhao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682089#comment-16682089
 ] 

shanyu zhao commented on SPARK-25999:
-

patch attached. Basically it creates an optional project that brings all 
dependencies to R/rjarsdep/target folder, and copy the missing jars to 
assembly/target folder before building R.

> make-distribution.sh failure with --r and -Phadoop-provided
> ---
>
> Key: SPARK-25999
> URL: https://issues.apache.org/jira/browse/SPARK-25999
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.2, 2.4.0
>Reporter: shanyu zhao
>Priority: Major
> Attachments: SPARK-25999.patch
>
>
> It is not possible to build a distribution that doesn't contain hadoop 
> dependencies but include SparkR. This is because R/check_cran.sh builds R 
> document which depends on hadoop dependencies in 
> assembly/target/scala-xxx/jars folder.
> To reproduce:
> MAVEN_BUILD_OPTS="-Dmaven.javadoc.skip=true -Pyarn -Phadoop-2.7 -Phive 
> -Psparkr -Phadoop-provided"
> ./dev/make-distribution.sh --tgz --r $MAVEN_BUILD_OPTS
>  
> Error:
> * creating vignettes ... ERROR
> ...
> Error: A JNI error has occurred, please check your installation and try again
> Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25999) make-distribution.sh failure with --r and -Phadoop-provided

2018-11-09 Thread shanyu zhao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shanyu zhao updated SPARK-25999:

Attachment: SPARK-25999.patch

> make-distribution.sh failure with --r and -Phadoop-provided
> ---
>
> Key: SPARK-25999
> URL: https://issues.apache.org/jira/browse/SPARK-25999
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.2, 2.4.0
>Reporter: shanyu zhao
>Priority: Major
> Attachments: SPARK-25999.patch
>
>
> It is not possible to build a distribution that doesn't contain hadoop 
> dependencies but include SparkR. This is because R/check_cran.sh builds R 
> document which depends on hadoop dependencies in 
> assembly/target/scala-xxx/jars folder.
> To reproduce:
> MAVEN_BUILD_OPTS="-Dmaven.javadoc.skip=true -Pyarn -Phadoop-2.7 -Phive 
> -Psparkr -Phadoop-provided"
> ./dev/make-distribution.sh --tgz --r $MAVEN_BUILD_OPTS
>  
> Error:
> * creating vignettes ... ERROR
> ...
> Error: A JNI error has occurred, please check your installation and try again
> Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25999) make-distribution.sh failure with --r and -Phadoop-provided

2018-11-09 Thread shanyu zhao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shanyu zhao updated SPARK-25999:

Summary: make-distribution.sh failure with --r and -Phadoop-provided  (was: 
Spark make-distribution failure with --r and -Phadoop-provided)

> make-distribution.sh failure with --r and -Phadoop-provided
> ---
>
> Key: SPARK-25999
> URL: https://issues.apache.org/jira/browse/SPARK-25999
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.2, 2.4.0
>Reporter: shanyu zhao
>Priority: Major
>
> It is not possible to build a distribution that doesn't contain hadoop 
> dependencies but include SparkR. This is because R/check_cran.sh builds R 
> document which depends on hadoop dependencies in 
> assembly/target/scala-xxx/jars folder.
> To reproduce:
> MAVEN_BUILD_OPTS="-Dmaven.javadoc.skip=true -Pyarn -Phadoop-2.7 -Phive 
> -Psparkr -Phadoop-provided"
> ./dev/make-distribution.sh --tgz --r $MAVEN_BUILD_OPTS
>  
> Error:
> * creating vignettes ... ERROR
> ...
> Error: A JNI error has occurred, please check your installation and try again
> Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25999) Spark make-distribution failure with --r and -Phadoop-provided

2018-11-09 Thread shanyu zhao (JIRA)
shanyu zhao created SPARK-25999:
---

 Summary: Spark make-distribution failure with --r and 
-Phadoop-provided
 Key: SPARK-25999
 URL: https://issues.apache.org/jira/browse/SPARK-25999
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.4.0, 2.3.2
Reporter: shanyu zhao


It is not possible to build a distribution that doesn't contain hadoop 
dependencies but include SparkR. This is because R/check_cran.sh builds R 
document which depends on hadoop dependencies in assembly/target/scala-xxx/jars 
folder.

To reproduce:

MAVEN_BUILD_OPTS="-Dmaven.javadoc.skip=true -Pyarn -Phadoop-2.7 -Phive -Psparkr 
-Phadoop-provided"

./dev/make-distribution.sh --tgz --r $MAVEN_BUILD_OPTS

 

Error:
* creating vignettes ... ERROR
...
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25997) Python example code for Power Iteration Clustering in spark.ml

2018-11-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682023#comment-16682023
 ] 

Apache Spark commented on SPARK-25997:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/22996

> Python example code for Power Iteration Clustering in spark.ml
> --
>
> Key: SPARK-25997
> URL: https://issues.apache.org/jira/browse/SPARK-25997
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Add a python example code for Power iteration clustering in spark.ml examples



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25997) Python example code for Power Iteration Clustering in spark.ml

2018-11-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25997:


Assignee: Apache Spark

> Python example code for Power Iteration Clustering in spark.ml
> --
>
> Key: SPARK-25997
> URL: https://issues.apache.org/jira/browse/SPARK-25997
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Minor
>
> Add a python example code for Power iteration clustering in spark.ml examples



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25997) Python example code for Power Iteration Clustering in spark.ml

2018-11-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25997:


Assignee: (was: Apache Spark)

> Python example code for Power Iteration Clustering in spark.ml
> --
>
> Key: SPARK-25997
> URL: https://issues.apache.org/jira/browse/SPARK-25997
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Add a python example code for Power iteration clustering in spark.ml examples



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25998) TorrentBroadcast holds strong reference to broadcast object

2018-11-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25998:


Assignee: Apache Spark

> TorrentBroadcast holds strong reference to broadcast object
> ---
>
> Key: SPARK-25998
> URL: https://issues.apache.org/jira/browse/SPARK-25998
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Brandon Krieger
>Assignee: Apache Spark
>Priority: Major
>
> If we do a large number of broadcast joins while holding onto the Dataset 
> reference, it will hold onto a large amount of memory for the value of the 
> broadcast object. The broadcast object is also held in the MemoryStore, but 
> that will clean itself up to prevent its memory usage from going over a 
> certain level. In my use case, I don't want to release the reference to the 
> Dataset (which would allow the broadcast object to be GCed) because I want to 
> be able to unpersist it at some point in the future (when it is no longer 
> relevant).
> See the following repro in Spark shell:
> {code:java}
> import org.apache.spark.sql.functions._
> import org.apache.spark.SparkEnv
> val startDf = (1 to 100).toDF("num").withColumn("num", 
> $"num".cast("string")).cache()
> val leftDf = startDf.withColumn("num", concat($"num", lit("0")))
> val rightDf = startDf.withColumn("num", concat($"num", lit("1")))
> val broadcastJoinedDf = leftDf.join(broadcast(rightDf), 
> leftDf.col("num").eqNullSafe(rightDf.col("num")))
> broadcastJoinedDf.count
> // Take a heap dump, see UnsafeHashedRelation with hard references in 
> MemoryStore and Dataset
> // Force the MemoryStore to clear itself
> SparkEnv.get.blockManager.stop
> // Trigger GC, then take another Heap Dump. The UnsafeHashedRelation is now 
> referenced only by the Dataset.
> {code}
> If we make the TorrentBroadcast hold a weak reference to the broadcast 
> object, the second heap dump will show nothing; the UnsafeHashedRelation has 
> been GCed.
> Given that the broadcast object can be reloaded from the MemoryStore, it 
> seems like it would be alright to use a WeakReference instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25998) TorrentBroadcast holds strong reference to broadcast object

2018-11-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682017#comment-16682017
 ] 

Apache Spark commented on SPARK-25998:
--

User 'bkrieger' has created a pull request for this issue:
https://github.com/apache/spark/pull/22995

> TorrentBroadcast holds strong reference to broadcast object
> ---
>
> Key: SPARK-25998
> URL: https://issues.apache.org/jira/browse/SPARK-25998
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Brandon Krieger
>Priority: Major
>
> If we do a large number of broadcast joins while holding onto the Dataset 
> reference, it will hold onto a large amount of memory for the value of the 
> broadcast object. The broadcast object is also held in the MemoryStore, but 
> that will clean itself up to prevent its memory usage from going over a 
> certain level. In my use case, I don't want to release the reference to the 
> Dataset (which would allow the broadcast object to be GCed) because I want to 
> be able to unpersist it at some point in the future (when it is no longer 
> relevant).
> See the following repro in Spark shell:
> {code:java}
> import org.apache.spark.sql.functions._
> import org.apache.spark.SparkEnv
> val startDf = (1 to 100).toDF("num").withColumn("num", 
> $"num".cast("string")).cache()
> val leftDf = startDf.withColumn("num", concat($"num", lit("0")))
> val rightDf = startDf.withColumn("num", concat($"num", lit("1")))
> val broadcastJoinedDf = leftDf.join(broadcast(rightDf), 
> leftDf.col("num").eqNullSafe(rightDf.col("num")))
> broadcastJoinedDf.count
> // Take a heap dump, see UnsafeHashedRelation with hard references in 
> MemoryStore and Dataset
> // Force the MemoryStore to clear itself
> SparkEnv.get.blockManager.stop
> // Trigger GC, then take another Heap Dump. The UnsafeHashedRelation is now 
> referenced only by the Dataset.
> {code}
> If we make the TorrentBroadcast hold a weak reference to the broadcast 
> object, the second heap dump will show nothing; the UnsafeHashedRelation has 
> been GCed.
> Given that the broadcast object can be reloaded from the MemoryStore, it 
> seems like it would be alright to use a WeakReference instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25998) TorrentBroadcast holds strong reference to broadcast object

2018-11-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25998:


Assignee: (was: Apache Spark)

> TorrentBroadcast holds strong reference to broadcast object
> ---
>
> Key: SPARK-25998
> URL: https://issues.apache.org/jira/browse/SPARK-25998
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Brandon Krieger
>Priority: Major
>
> If we do a large number of broadcast joins while holding onto the Dataset 
> reference, it will hold onto a large amount of memory for the value of the 
> broadcast object. The broadcast object is also held in the MemoryStore, but 
> that will clean itself up to prevent its memory usage from going over a 
> certain level. In my use case, I don't want to release the reference to the 
> Dataset (which would allow the broadcast object to be GCed) because I want to 
> be able to unpersist it at some point in the future (when it is no longer 
> relevant).
> See the following repro in Spark shell:
> {code:java}
> import org.apache.spark.sql.functions._
> import org.apache.spark.SparkEnv
> val startDf = (1 to 100).toDF("num").withColumn("num", 
> $"num".cast("string")).cache()
> val leftDf = startDf.withColumn("num", concat($"num", lit("0")))
> val rightDf = startDf.withColumn("num", concat($"num", lit("1")))
> val broadcastJoinedDf = leftDf.join(broadcast(rightDf), 
> leftDf.col("num").eqNullSafe(rightDf.col("num")))
> broadcastJoinedDf.count
> // Take a heap dump, see UnsafeHashedRelation with hard references in 
> MemoryStore and Dataset
> // Force the MemoryStore to clear itself
> SparkEnv.get.blockManager.stop
> // Trigger GC, then take another Heap Dump. The UnsafeHashedRelation is now 
> referenced only by the Dataset.
> {code}
> If we make the TorrentBroadcast hold a weak reference to the broadcast 
> object, the second heap dump will show nothing; the UnsafeHashedRelation has 
> been GCed.
> Given that the broadcast object can be reloaded from the MemoryStore, it 
> seems like it would be alright to use a WeakReference instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25998) TorrentBroadcast holds strong reference to broadcast object

2018-11-09 Thread Brandon Krieger (JIRA)
Brandon Krieger created SPARK-25998:
---

 Summary: TorrentBroadcast holds strong reference to broadcast 
object
 Key: SPARK-25998
 URL: https://issues.apache.org/jira/browse/SPARK-25998
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Brandon Krieger


If we do a large number of broadcast joins while holding onto the Dataset 
reference, it will hold onto a large amount of memory for the value of the 
broadcast object. The broadcast object is also held in the MemoryStore, but 
that will clean itself up to prevent its memory usage from going over a certain 
level. In my use case, I don't want to release the reference to the Dataset 
(which would allow the broadcast object to be GCed) because I want to be able 
to unpersist it at some point in the future (when it is no longer relevant).

See the following repro in Spark shell:

{code:java}
import org.apache.spark.sql.functions._
import org.apache.spark.SparkEnv

val startDf = (1 to 100).toDF("num").withColumn("num", 
$"num".cast("string")).cache()
val leftDf = startDf.withColumn("num", concat($"num", lit("0")))
val rightDf = startDf.withColumn("num", concat($"num", lit("1")))
val broadcastJoinedDf = leftDf.join(broadcast(rightDf), 
leftDf.col("num").eqNullSafe(rightDf.col("num")))
broadcastJoinedDf.count

// Take a heap dump, see UnsafeHashedRelation with hard references in 
MemoryStore and Dataset

// Force the MemoryStore to clear itself
SparkEnv.get.blockManager.stop

// Trigger GC, then take another Heap Dump. The UnsafeHashedRelation is now 
referenced only by the Dataset.
{code}

If we make the TorrentBroadcast hold a weak reference to the broadcast object, 
the second heap dump will show nothing; the UnsafeHashedRelation has been GCed.

Given that the broadcast object can be reloaded from the MemoryStore, it seems 
like it would be alright to use a WeakReference instead.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25997) Python example code for Power Iteration Clustering in spark.ml

2018-11-09 Thread Huaxin Gao (JIRA)
Huaxin Gao created SPARK-25997:
--

 Summary: Python example code for Power Iteration Clustering in 
spark.ml
 Key: SPARK-25997
 URL: https://issues.apache.org/jira/browse/SPARK-25997
 Project: Spark
  Issue Type: Documentation
  Components: ML
Affects Versions: 3.0.0
Reporter: Huaxin Gao


Add a python example code for Power iteration clustering in spark.ml examples



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24101) MulticlassClassificationEvaluator should use sample weight data

2018-11-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-24101:
-

Assignee: Ilya Matiach

> MulticlassClassificationEvaluator should use sample weight data
> ---
>
> Key: SPARK-24101
> URL: https://issues.apache.org/jira/browse/SPARK-24101
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Ilya Matiach
>Assignee: Ilya Matiach
>Priority: Major
> Fix For: 3.0.0
>
>
> The LogisticRegression and LinearRegression models support training with a 
> weight column, but the corresponding evaluators do not support computing 
> metrics using those weights. This breaks model selection using CrossValidator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24101) MulticlassClassificationEvaluator should use sample weight data

2018-11-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24101.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 17086
[https://github.com/apache/spark/pull/17086]

> MulticlassClassificationEvaluator should use sample weight data
> ---
>
> Key: SPARK-24101
> URL: https://issues.apache.org/jira/browse/SPARK-24101
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Ilya Matiach
>Assignee: Ilya Matiach
>Priority: Major
> Fix For: 3.0.0
>
>
> The LogisticRegression and LinearRegression models support training with a 
> weight column, but the corresponding evaluators do not support computing 
> metrics using those weights. This breaks model selection using CrossValidator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25996) Agregaciones no retornan los valores correctos con rows con timestamps iguales

2018-11-09 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-25996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ignacio Gómez updated SPARK-25996:
--
Description: 
Hi all,

When using pyspark I perform a count operation prior to the previous date of 
the current row, including in the count the current row, with the corresponding 
query:

query = """
 select *,* count ( * ) over (partition by ACCOUNTID
 order by TS
 range between interval 5000 milliseconds preceding and current row) as 
total_count
 from df3
 """
 df3 = sqlContext.sql(query)

and return the following:

 
|ACCOUNTID|AMOUNT|TS|total_count|
|1|100|2018-01-01 00:00:01|1|
|1|1000|2018-01-01 10:00:01|1|
|1|25|2018-01-01 10:00:02|2|
|1|500|2018-01-01 10:00:03|3|
|1|100|2018-01-01 10:00:04|4|
|1|80|2018-01-01 10:00:05|5|
|1|700|2018-01-01 11:00:04|1|
|1|205|2018-01-02 10:00:02|1|
|1|500|2018-01-02 10:00:03|2|
|3|80|2018-01-02 10:00:05|1|

 

As you can see, in the third row, the total_count should give 3 instead of 2 
because there are 2 previous records and not 1. In the following rows, the 
error is dragged.
This happens with the other aggregation operations.

Beyond the fact that the date of the first rows is the same, that does not mean 
that these two exist and should not be considered as the only one that exists 
is the last one with the same date.

 

Could you help me?

Thank you

  was:
Qué, tal?

Al utilizar pyspark realizo una operación de conteo de registros previos a la 
fecha anterior de la row actual, incluyendo en el conteo la row actual, con la 
correspondiente query:

query = """
 select *,* count ( * ) over (partition by ACCOUNTID
 order by TS
 range between interval 5000 milliseconds preceding and current row) as 
total_count
 from df3
 """
 df3 = sqlContext.sql(query)

y retorna lo siguiente:

 
|ACCOUNTID|AMOUNT|TS|total_count|
|1|100|2018-01-01 00:00:01|1|
|1|1000|2018-01-01 10:00:01|1|
|1|25|2018-01-01 10:00:02|2|
|1|500|2018-01-01 10:00:03|3|
|1|100|2018-01-01 10:00:04|4|
|1|80|2018-01-01 10:00:05|5|
|1|700|2018-01-01 11:00:04|1|
|1|205|2018-01-02 10:00:02|1|
|1|500|2018-01-02 10:00:03|2|
|3|80|2018-01-02 10:00:05|1|

Como se puede apreciar, en la tercera row, el total_count debería dar 3 en vez 
de 2 porque existen 2 registros previos y no 1. En las rows siguientes, se 
arrastra el error.
 Esto ocurre con las demás operaciones de agregación.

Más allá de que la fecha de las primeras rows sea la misma, eso no quita que 
estas dos existan y no debería considerarse como que la única que existe es la 
última que tenga la misma fecha.

 

Me podrían ayudar?

Muchas gracias


> Agregaciones no retornan los valores correctos con rows con timestamps iguales
> --
>
> Key: SPARK-25996
> URL: https://issues.apache.org/jira/browse/SPARK-25996
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1, 2.4.0
> Environment: Windows 10
> PyCharm 2018.2.2
> Python 3.6
>  
>Reporter: Ignacio Gómez
>Priority: Major
>
> Hi all,
> When using pyspark I perform a count operation prior to the previous date of 
> the current row, including in the count the current row, with the 
> corresponding query:
> query = """
>  select *,* count ( * ) over (partition by ACCOUNTID
>  order by TS
>  range between interval 5000 milliseconds preceding and current row) as 
> total_count
>  from df3
>  """
>  df3 = sqlContext.sql(query)
> and return the following:
>  
> |ACCOUNTID|AMOUNT|TS|total_count|
> |1|100|2018-01-01 00:00:01|1|
> |1|1000|2018-01-01 10:00:01|1|
> |1|25|2018-01-01 10:00:02|2|
> |1|500|2018-01-01 10:00:03|3|
> |1|100|2018-01-01 10:00:04|4|
> |1|80|2018-01-01 10:00:05|5|
> |1|700|2018-01-01 11:00:04|1|
> |1|205|2018-01-02 10:00:02|1|
> |1|500|2018-01-02 10:00:03|2|
> |3|80|2018-01-02 10:00:05|1|
>  
> As you can see, in the third row, the total_count should give 3 instead of 2 
> because there are 2 previous records and not 1. In the following rows, the 
> error is dragged.
> This happens with the other aggregation operations.
> Beyond the fact that the date of the first rows is the same, that does not 
> mean that these two exist and should not be considered as the only one that 
> exists is the last one with the same date.
>  
> Could you help me?
> Thank you



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25996) Agregaciones no retornan los valores correctos con rows con timestamps iguales

2018-11-09 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-25996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ignacio Gómez updated SPARK-25996:
--
Description: 
Qué, tal?

Al utilizar pyspark realizo una operación de conteo de registros previos a la 
fecha anterior de la row actual, incluyendo en el conteo la row actual, con la 
correspondiente query:

query = """
 select *,* count ( * ) over (partition by ACCOUNTID
 order by TS
 range between interval 5000 milliseconds preceding and current row) as 
total_count
 from df3
 """
 df3 = sqlContext.sql(query)

y retorna lo siguiente:

 
|ACCOUNTID|AMOUNT|TS|total_count|
|1|100|2018-01-01 00:00:01|1|
|1|1000|2018-01-01 10:00:01|1|
|1|25|2018-01-01 10:00:02|2|
|1|500|2018-01-01 10:00:03|3|
|1|100|2018-01-01 10:00:04|4|
|1|80|2018-01-01 10:00:05|5|
|1|700|2018-01-01 11:00:04|1|
|1|205|2018-01-02 10:00:02|1|
|1|500|2018-01-02 10:00:03|2|
|3|80|2018-01-02 10:00:05|1|

Como se puede apreciar, en la tercera row, el total_count debería dar 3 en vez 
de 2 porque existen 2 registros previos y no 1. En las rows siguientes, se 
arrastra el error.
 Esto ocurre con las demás operaciones de agregación.

Más allá de que la fecha de las primeras rows sea la misma, eso no quita que 
estas dos existan y no debería considerarse como que la única que existe es la 
última que tenga la misma fecha.

 

Me podrían ayudar?

Muchas gracias

  was:
Qué, tal?

Al utilizar pyspark realizo una operación de conteo de registros previos a la 
fecha anterior de la row actual, incluyendo en el conteo la row actual, con la 
correspondiente query:

query = """
 select *,* count ( * ) over (partition by ACCOUNTID
 order by TS
 range between interval 5000 milliseconds preceding and current row) as 
total_count
 from df3
 """
 df3 = sqlContext.sql(query)

y retorna lo siguiente:

+-+--+---+---+
|ACCOUNTID|AMOUNT| TS|total_count|
+-+--+---+---+
| 1| 100|2018-01-01 00:00:01| 1|
| 1| 1000|2018-01-01 10:00:01| 1|
| 1| 25|2018-01-01 10:00:02| 2|
| 1| 500|2018-01-01 10:00:03| 3|
| 1| 100|2018-01-01 10:00:04| 4|
| 1| 80|2018-01-01 10:00:05| 5|
| 1| 700|2018-01-01 11:00:04| 1|
| 1| 205|2018-01-02 10:00:02| 1|
| 1| 500|2018-01-02 10:00:03| 2|
| 3| 80|2018-01-02 10:00:05| 1|

Como se puede apreciar, en la tercera row, el total_count debería dar 3 en vez 
de 2 porque existen 2 registros previos y no 1. En las rows siguientes, se 
arrastra el error.
 Esto ocurre con las demás operaciones de agregación.

Más allá de que la fecha de las primeras rows sea la misma, eso no quita que 
estas dos existan y no debería considerarse como que la única que existe es la 
última que tenga la misma fecha.

 

Me podrían ayudar?

Muchas gracias


> Agregaciones no retornan los valores correctos con rows con timestamps iguales
> --
>
> Key: SPARK-25996
> URL: https://issues.apache.org/jira/browse/SPARK-25996
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1, 2.4.0
> Environment: Windows 10
> PyCharm 2018.2.2
> Python 3.6
>  
>Reporter: Ignacio Gómez
>Priority: Major
>
> Qué, tal?
> Al utilizar pyspark realizo una operación de conteo de registros previos a la 
> fecha anterior de la row actual, incluyendo en el conteo la row actual, con 
> la correspondiente query:
> query = """
>  select *,* count ( * ) over (partition by ACCOUNTID
>  order by TS
>  range between interval 5000 milliseconds preceding and current row) as 
> total_count
>  from df3
>  """
>  df3 = sqlContext.sql(query)
> y retorna lo siguiente:
>  
> |ACCOUNTID|AMOUNT|TS|total_count|
> |1|100|2018-01-01 00:00:01|1|
> |1|1000|2018-01-01 10:00:01|1|
> |1|25|2018-01-01 10:00:02|2|
> |1|500|2018-01-01 10:00:03|3|
> |1|100|2018-01-01 10:00:04|4|
> |1|80|2018-01-01 10:00:05|5|
> |1|700|2018-01-01 11:00:04|1|
> |1|205|2018-01-02 10:00:02|1|
> |1|500|2018-01-02 10:00:03|2|
> |3|80|2018-01-02 10:00:05|1|
> Como se puede apreciar, en la tercera row, el total_count debería dar 3 en 
> vez de 2 porque existen 2 registros previos y no 1. En las rows siguientes, 
> se arrastra el error.
>  Esto ocurre con las demás operaciones de agregación.
> Más allá de que la fecha de las primeras rows sea la misma, eso no quita que 
> estas dos existan y no debería considerarse como que la única que existe es 
> la última que tenga la misma fecha.
>  
> Me podrían ayudar?
> Muchas gracias



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25996) Agregaciones no retornan los valores correctos con rows con timestamps iguales

2018-11-09 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-25996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ignacio Gómez updated SPARK-25996:
--
Description: 
Qué, tal?

Al utilizar pyspark realizo una operación de conteo de registros previos a la 
fecha anterior de la row actual, incluyendo en el conteo la row actual, con la 
correspondiente query:

query = """
 select *,* count ( * ) over (partition by ACCOUNTID
 order by TS
 range between interval 5000 milliseconds preceding and current row) as 
total_count
 from df3
 """
 df3 = sqlContext.sql(query)

y retorna lo siguiente:

+-+--+---+---+
|ACCOUNTID|AMOUNT| TS|total_count|
+-+--+---+---+
| 1| 100|2018-01-01 00:00:01| 1|
| 1| 1000|2018-01-01 10:00:01| 1|
| 1| 25|2018-01-01 10:00:02| 2|
| 1| 500|2018-01-01 10:00:03| 3|
| 1| 100|2018-01-01 10:00:04| 4|
| 1| 80|2018-01-01 10:00:05| 5|
| 1| 700|2018-01-01 11:00:04| 1|
| 1| 205|2018-01-02 10:00:02| 1|
| 1| 500|2018-01-02 10:00:03| 2|
| 3| 80|2018-01-02 10:00:05| 1|

Como se puede apreciar, en la tercera row, el total_count debería dar 3 en vez 
de 2 porque existen 2 registros previos y no 1. En las rows siguientes, se 
arrastra el error.
 Esto ocurre con las demás operaciones de agregación.

Más allá de que la fecha de las primeras rows sea la misma, eso no quita que 
estas dos existan y no debería considerarse como que la única que existe es la 
última que tenga la misma fecha.

 

Me podrían ayudar?

Muchas gracias

  was:
Qué, tal?

Al utilizar pyspark realizo una operación de conteo de registros previos a la 
fecha anterior de la row actual, incluyendo en el conteo la row actual, con la 
correspondiente query:

query = """
select *, count(*) over (partition by ACCOUNTID
 order by TS
 range between interval 5000 milliseconds preceding and current row) as 
total_count
from df3
"""
df3 = sqlContext.sql(query)

y retorna lo siguiente:



!image-2018-11-09-18-25-55-296.png!

Como se puede apreciar, en la tercera row, el total_count debería dar 3 en vez 
de 2 porque existen 2 registros previos y no 1. En las rows siguientes, se 
arrastra el error.
Esto ocurre con las demás operaciones de agregación.

Más allá de que la fecha de las primeras rows sea la misma, eso no quita que 
estas dos existan y no debería considerarse como que la única que existe es la 
última que tenga la misma fecha.

 

Me podrían ayudar?

Muchas gracias


> Agregaciones no retornan los valores correctos con rows con timestamps iguales
> --
>
> Key: SPARK-25996
> URL: https://issues.apache.org/jira/browse/SPARK-25996
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1, 2.4.0
> Environment: Windows 10
> PyCharm 2018.2.2
> Python 3.6
>  
>Reporter: Ignacio Gómez
>Priority: Major
>
> Qué, tal?
> Al utilizar pyspark realizo una operación de conteo de registros previos a la 
> fecha anterior de la row actual, incluyendo en el conteo la row actual, con 
> la correspondiente query:
> query = """
>  select *,* count ( * ) over (partition by ACCOUNTID
>  order by TS
>  range between interval 5000 milliseconds preceding and current row) as 
> total_count
>  from df3
>  """
>  df3 = sqlContext.sql(query)
> y retorna lo siguiente:
> +-+--+---+---+
> |ACCOUNTID|AMOUNT| TS|total_count|
> +-+--+---+---+
> | 1| 100|2018-01-01 00:00:01| 1|
> | 1| 1000|2018-01-01 10:00:01| 1|
> | 1| 25|2018-01-01 10:00:02| 2|
> | 1| 500|2018-01-01 10:00:03| 3|
> | 1| 100|2018-01-01 10:00:04| 4|
> | 1| 80|2018-01-01 10:00:05| 5|
> | 1| 700|2018-01-01 11:00:04| 1|
> | 1| 205|2018-01-02 10:00:02| 1|
> | 1| 500|2018-01-02 10:00:03| 2|
> | 3| 80|2018-01-02 10:00:05| 1|
> Como se puede apreciar, en la tercera row, el total_count debería dar 3 en 
> vez de 2 porque existen 2 registros previos y no 1. En las rows siguientes, 
> se arrastra el error.
>  Esto ocurre con las demás operaciones de agregación.
> Más allá de que la fecha de las primeras rows sea la misma, eso no quita que 
> estas dos existan y no debería considerarse como que la única que existe es 
> la última que tenga la misma fecha.
>  
> Me podrían ayudar?
> Muchas gracias



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25996) Agregaciones no retornan los valores correctos con rows con timestamps iguales

2018-11-09 Thread JIRA
Ignacio Gómez created SPARK-25996:
-

 Summary: Agregaciones no retornan los valores correctos con rows 
con timestamps iguales
 Key: SPARK-25996
 URL: https://issues.apache.org/jira/browse/SPARK-25996
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.0, 2.3.1
 Environment: Windows 10
PyCharm 2018.2.2

Python 3.6

 
Reporter: Ignacio Gómez


Qué, tal?

Al utilizar pyspark realizo una operación de conteo de registros previos a la 
fecha anterior de la row actual, incluyendo en el conteo la row actual, con la 
correspondiente query:

query = """
select *, count(*) over (partition by ACCOUNTID
 order by TS
 range between interval 5000 milliseconds preceding and current row) as 
total_count
from df3
"""
df3 = sqlContext.sql(query)

y retorna lo siguiente:



!image-2018-11-09-18-25-55-296.png!

Como se puede apreciar, en la tercera row, el total_count debería dar 3 en vez 
de 2 porque existen 2 registros previos y no 1. En las rows siguientes, se 
arrastra el error.
Esto ocurre con las demás operaciones de agregación.

Más allá de que la fecha de las primeras rows sea la misma, eso no quita que 
estas dos existan y no debería considerarse como que la única que existe es la 
última que tenga la misma fecha.

 

Me podrían ayudar?

Muchas gracias



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21542) Helper functions for custom Python Persistence

2018-11-09 Thread John Bauer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681895#comment-16681895
 ] 

John Bauer edited comment on SPARK-21542 at 11/9/18 8:07 PM:
-

Compared to the previous, the above example is a) much more minimal, b) 
genuinely useful, and c) actually works with save and load, for example:
{code:java}
impute.write().save("impute")
imp = ImputeNormal.load("impute")
imp.explainParams()
impute_model.write().save("impute_model")
impm = ImputeNormalModel.load("impute_model")
impm.explainParams(){code}


was (Author: johnhbauer):
This is a) much more minimal, b) genuinely useful, and c) actually works with 
save and load, for example:
{code:java}
impute.write().save("impute")
imp = ImputeNormal.load("impute")
imp.explainParams()
impute_model.write().save("impute_model")
impm = ImputeNormalModel.load("impute_model")
impm.explainParams(){code}

> Helper functions for custom Python Persistence
> --
>
> Key: SPARK-21542
> URL: https://issues.apache.org/jira/browse/SPARK-21542
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Ajay Saini
>Assignee: Ajay Saini
>Priority: Major
> Fix For: 2.3.0
>
>
> Currently, there is no way to easily persist Json-serializable parameters in 
> Python only. All parameters in Python are persisted by converting them to 
> Java objects and using the Java persistence implementation. In order to 
> facilitate the creation of custom Python-only pipeline stages, it would be 
> good to have a Python-only persistence framework so that these stages do not 
> need to be implemented in Scala for persistence. 
> This task involves:
> - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, 
> DefaultParamsReader, and DefaultParamsWriter in pyspark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21542) Helper functions for custom Python Persistence

2018-11-09 Thread John Bauer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681895#comment-16681895
 ] 

John Bauer edited comment on SPARK-21542 at 11/9/18 7:56 PM:
-

This is a) much more minimal, b) genuinely useful, and c) actually works with 
save and load, for example:
{code:java}
impute.write().save("impute")
imp = ImputeNormal.load("impute")
imp.explainParams()
impute_model.write().save("impute_model")
impm = ImputeNormalModel.load("impute_model")
impm.explainParams(){code}


was (Author: johnhbauer):
This is a) much more minimal, b) genuinely useful, and c) actually works with 
save and load, for example:
{code:java}
impute.write().save("impute")
 imp = ImputeNormal.load("impute")
 imp.explainParams()
 impute_model.write().save("impute_model")
 impm = ImputeNormalModel.load("imputer_model")
 impm = ImputeNormalModel.load("impute_model")
 impm.getInputCol()
 impm.getOutputCol()
 impm.getMean()
 impm.getStddev(){code}

> Helper functions for custom Python Persistence
> --
>
> Key: SPARK-21542
> URL: https://issues.apache.org/jira/browse/SPARK-21542
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Ajay Saini
>Assignee: Ajay Saini
>Priority: Major
> Fix For: 2.3.0
>
>
> Currently, there is no way to easily persist Json-serializable parameters in 
> Python only. All parameters in Python are persisted by converting them to 
> Java objects and using the Java persistence implementation. In order to 
> facilitate the creation of custom Python-only pipeline stages, it would be 
> good to have a Python-only persistence framework so that these stages do not 
> need to be implemented in Scala for persistence. 
> This task involves:
> - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, 
> DefaultParamsReader, and DefaultParamsWriter in pyspark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21542) Helper functions for custom Python Persistence

2018-11-09 Thread John Bauer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681895#comment-16681895
 ] 

John Bauer commented on SPARK-21542:


This is a) much more minimal, b) genuinely useful, and c) actually works with 
save and load, for example:

impute.write().save("impute")
imp = ImputeNormal.load("impute")
imp.explainParams()
impute_model.write().save("impute_model")
impm = ImputeNormalModel.load("imputer_model")
impm = ImputeNormalModel.load("impute_model")
impm.getInputCol()
impm.getOutputCol()
impm.getMean()
impm.getStddev()

> Helper functions for custom Python Persistence
> --
>
> Key: SPARK-21542
> URL: https://issues.apache.org/jira/browse/SPARK-21542
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Ajay Saini
>Assignee: Ajay Saini
>Priority: Major
> Fix For: 2.3.0
>
>
> Currently, there is no way to easily persist Json-serializable parameters in 
> Python only. All parameters in Python are persisted by converting them to 
> Java objects and using the Java persistence implementation. In order to 
> facilitate the creation of custom Python-only pipeline stages, it would be 
> good to have a Python-only persistence framework so that these stages do not 
> need to be implemented in Scala for persistence. 
> This task involves:
> - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, 
> DefaultParamsReader, and DefaultParamsWriter in pyspark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21542) Helper functions for custom Python Persistence

2018-11-09 Thread John Bauer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681895#comment-16681895
 ] 

John Bauer edited comment on SPARK-21542 at 11/9/18 7:54 PM:
-

This is a) much more minimal, b) genuinely useful, and c) actually works with 
save and load, for example:
{code:java}
impute.write().save("impute")
 imp = ImputeNormal.load("impute")
 imp.explainParams()
 impute_model.write().save("impute_model")
 impm = ImputeNormalModel.load("imputer_model")
 impm = ImputeNormalModel.load("impute_model")
 impm.getInputCol()
 impm.getOutputCol()
 impm.getMean()
 impm.getStddev(){code}


was (Author: johnhbauer):
This is a) much more minimal, b) genuinely useful, and c) actually works with 
save and load, for example:

impute.write().save("impute")
imp = ImputeNormal.load("impute")
imp.explainParams()
impute_model.write().save("impute_model")
impm = ImputeNormalModel.load("imputer_model")
impm = ImputeNormalModel.load("impute_model")
impm.getInputCol()
impm.getOutputCol()
impm.getMean()
impm.getStddev()

> Helper functions for custom Python Persistence
> --
>
> Key: SPARK-21542
> URL: https://issues.apache.org/jira/browse/SPARK-21542
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Ajay Saini
>Assignee: Ajay Saini
>Priority: Major
> Fix For: 2.3.0
>
>
> Currently, there is no way to easily persist Json-serializable parameters in 
> Python only. All parameters in Python are persisted by converting them to 
> Java objects and using the Java persistence implementation. In order to 
> facilitate the creation of custom Python-only pipeline stages, it would be 
> good to have a Python-only persistence framework so that these stages do not 
> need to be implemented in Scala for persistence. 
> This task involves:
> - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, 
> DefaultParamsReader, and DefaultParamsWriter in pyspark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25958) error: [Errno 97] Address family not supported by protocol in dataframe.take()

2018-11-09 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681892#comment-16681892
 ] 

Ruslan Dautkhanov commented on SPARK-25958:
---

Yep, the pyspark job completes fine afetr we removed ipv6 references in 
/etc/hosts 

Thank you both 

> error: [Errno 97] Address family not supported by protocol in dataframe.take()
> --
>
> Key: SPARK-25958
> URL: https://issues.apache.org/jira/browse/SPARK-25958
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Following error happens on a heavy Spark job after 4 hours of runtime..
> {code}
> 2018-11-06 14:35:56,604 - data_vault.py - ERROR - Exited with exception: 
> [Errno 97] Address family not supported by protocol
> Traceback (most recent call last):
>   File "/home/mwincek/svn/data_vault/data_vault.py", line 64, in data_vault
> item.create_persistent_data()
>   File "/home/mwincek/svn/data_vault/src/table_recipe/amf_table_recipe.py", 
> line 53, in create_persistent_data
> single_obj.create_persistent_data()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 21, in create_persistent_data
> main_df = self.generate_dataframe_main()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 98, in generate_dataframe_main
> raw_disc_dv_df = self.get_raw_data_with_metadata_and_aggregation()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 16, in get_raw_data_with_metadata_and_aggregation
> main_df = 
> self.get_dataframe_using_binary_date_aggregation_on_dataframe(input_df=raw_disc_dv_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 60, in get_dataframe_using_binary_date_aggregation_on_dataframe
> return_df = self.get_dataframe_from_binary_value_iteration(input_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 136, in get_dataframe_from_binary_value_iteration
> combine_df = self.get_dataframe_from_binary_value(input_df=input_df, 
> binary_value=count)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 154, in get_dataframe_from_binary_value
> if len(results_of_filter_df.take(1)) == 0:
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 504, in take
> return self.limit(num).collect()
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 467, in collect
> return list(_load_from_socket(sock_info, 
> BatchedSerializer(PickleSerializer(
>   File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/rdd.py", line 
> 148, in _load_from_socket
> sock = socket.socket(af, socktype, proto)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/socket.py", line 191, in 
> __init__
> _sock = _realsocket(family, type, proto)
> error: [Errno 97] Address family not supported by protocol
> {code}
> Looking at the failing line in lib/spark2/python/pyspark/rdd.py, line 148:
> {code}
> def _load_from_socket(sock_info, serializer):
> port, auth_secret = sock_info
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
> af, socktype, proto, canonname, sa = res
> sock = socket.socket(af, socktype, proto)
> try:
> sock.settimeout(15)
> sock.connect(sa)
> except socket.error:
> sock.close()
> sock = None
> continue
> break
> if not sock:
> raise Exception("could not open socket")
> # The RDD materialization time is unpredicable, if we set a timeout for 
> socket reading
> # operation, it will very possibly fail. See SPARK-18281.
> sock.settimeout(None)
> sockfile = sock.makefile("rwb", 65536)
> do_server_auth(sockfile, auth_secret)
> # The socket will be automatically closed when garbage-collected.
> return serializer.load_stream(sockfile)
> {code}
> the culprint is in lib/spark2/python/pyspark/rdd.py in this line 
> {code}
> socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM)
> {code}
> so the error "error: [Errno 97] *Address family* not supported by protocol"
> seems to be caused by socket.AF_UNSPEC third option to the 
> socket.getaddrinfo() call.
> I tried to call similar socket.getaddrinfo call locally outside 

[jira] [Resolved] (SPARK-25958) error: [Errno 97] Address family not supported by protocol in dataframe.take()

2018-11-09 Thread Ruslan Dautkhanov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov resolved SPARK-25958.
---
Resolution: Not A Problem

> error: [Errno 97] Address family not supported by protocol in dataframe.take()
> --
>
> Key: SPARK-25958
> URL: https://issues.apache.org/jira/browse/SPARK-25958
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Following error happens on a heavy Spark job after 4 hours of runtime..
> {code}
> 2018-11-06 14:35:56,604 - data_vault.py - ERROR - Exited with exception: 
> [Errno 97] Address family not supported by protocol
> Traceback (most recent call last):
>   File "/home/mwincek/svn/data_vault/data_vault.py", line 64, in data_vault
> item.create_persistent_data()
>   File "/home/mwincek/svn/data_vault/src/table_recipe/amf_table_recipe.py", 
> line 53, in create_persistent_data
> single_obj.create_persistent_data()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 21, in create_persistent_data
> main_df = self.generate_dataframe_main()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 98, in generate_dataframe_main
> raw_disc_dv_df = self.get_raw_data_with_metadata_and_aggregation()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 16, in get_raw_data_with_metadata_and_aggregation
> main_df = 
> self.get_dataframe_using_binary_date_aggregation_on_dataframe(input_df=raw_disc_dv_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 60, in get_dataframe_using_binary_date_aggregation_on_dataframe
> return_df = self.get_dataframe_from_binary_value_iteration(input_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 136, in get_dataframe_from_binary_value_iteration
> combine_df = self.get_dataframe_from_binary_value(input_df=input_df, 
> binary_value=count)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 154, in get_dataframe_from_binary_value
> if len(results_of_filter_df.take(1)) == 0:
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 504, in take
> return self.limit(num).collect()
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 467, in collect
> return list(_load_from_socket(sock_info, 
> BatchedSerializer(PickleSerializer(
>   File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/rdd.py", line 
> 148, in _load_from_socket
> sock = socket.socket(af, socktype, proto)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/socket.py", line 191, in 
> __init__
> _sock = _realsocket(family, type, proto)
> error: [Errno 97] Address family not supported by protocol
> {code}
> Looking at the failing line in lib/spark2/python/pyspark/rdd.py, line 148:
> {code}
> def _load_from_socket(sock_info, serializer):
> port, auth_secret = sock_info
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
> af, socktype, proto, canonname, sa = res
> sock = socket.socket(af, socktype, proto)
> try:
> sock.settimeout(15)
> sock.connect(sa)
> except socket.error:
> sock.close()
> sock = None
> continue
> break
> if not sock:
> raise Exception("could not open socket")
> # The RDD materialization time is unpredicable, if we set a timeout for 
> socket reading
> # operation, it will very possibly fail. See SPARK-18281.
> sock.settimeout(None)
> sockfile = sock.makefile("rwb", 65536)
> do_server_auth(sockfile, auth_secret)
> # The socket will be automatically closed when garbage-collected.
> return serializer.load_stream(sockfile)
> {code}
> the culprint is in lib/spark2/python/pyspark/rdd.py in this line 
> {code}
> socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM)
> {code}
> so the error "error: [Errno 97] *Address family* not supported by protocol"
> seems to be caused by socket.AF_UNSPEC third option to the 
> socket.getaddrinfo() call.
> I tried to call similar socket.getaddrinfo call locally outside of PySpark 
> and it worked fine.
> RHEL 7.5.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SPARK-21542) Helper functions for custom Python Persistence

2018-11-09 Thread John Bauer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681891#comment-16681891
 ] 

John Bauer commented on SPARK-21542:


{code}
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, randn

from pyspark import keyword_only
from pyspark.ml import Estimator, Model
#from pyspark.ml.feature import SQLTransformer
from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable
from pyspark.ml.param import Param, Params, TypeConverters
from pyspark.ml.param.shared import HasInputCol, HasOutputCol

spark = SparkSession\
.builder\
.appName("ImputeNormal")\
.getOrCreate()

class ImputeNormal(Estimator,
   HasInputCol,
   HasOutputCol,
   DefaultParamsReadable,
   DefaultParamsWritable,
   ):
@keyword_only
def __init__(self, inputCol="inputCol", outputCol="outputCol"):
super(ImputeNormal, self).__init__()

self._setDefault(inputCol="inputCol", outputCol="outputCol")
kwargs = self._input_kwargs
self.setParams(**kwargs)

@keyword_only
def setParams(self, inputCol="inputCol", outputCol="outputCol"):
"""
setParams(self, inputCol="inputCol", outputCol="outputCol")
"""
kwargs = self._input_kwargs
self._set(**kwargs)
return self

def _fit(self, data):
inputCol = self.getInputCol()
outputCol = self.getOutputCol()

stats = data.select(inputCol).describe()
mean = stats.where(col("summary") == "mean").take(1)[0][inputCol]
stddev = stats.where(col("summary") == "stddev").take(1)[0][inputCol]

return ImputeNormalModel(mean=float(mean),
 stddev=float(stddev),
 inputCol=inputCol,
 outputCol=outputCol,
 )
# FOR A TRULY MINIMAL BUT LESS DIDACTICALLY EFFECTIVE DEMO, DO INSTEAD:
#sql_text = "SELECT *, IF({inputCol} IS NULL, {stddev} * randn() + 
{mean}, {inputCol}) AS {outputCol} FROM __THIS__"
#
#return SQLTransformer(statement=sql_text.format(stddev=stddev, 
mean=mean, inputCol=inputCol, outputCol=outputCol))
   
class ImputeNormalModel(Model,
HasInputCol,
HasOutputCol,
DefaultParamsReadable,
DefaultParamsWritable,
):

mean = Param(Params._dummy(), "mean", "Mean value of imputations. 
Calculated by fit method.",
  typeConverter=TypeConverters.toFloat)

stddev = Param(Params._dummy(), "stddev", "Standard deviation of 
imputations. Calculated by fit method.",
  typeConverter=TypeConverters.toFloat)


@keyword_only
def __init__(self, mean=0.0, stddev=1.0, inputCol="inputCol", 
outputCol="outputCol"):
super(ImputeNormalModel, self).__init__()

self._setDefault(mean=0.0, stddev=1.0, inputCol="inputCol", 
outputCol="outputCol")
kwargs = self._input_kwargs
self.setParams(**kwargs)

@keyword_only
def setParams(self, mean=0.0, stddev=1.0, inputCol="inputCol", 
outputCol="outputCol"):
"""
setParams(self, mean=0.0, stddev=1.0, inputCol="inputCol", 
outputCol="outputCol")
"""
kwargs = self._input_kwargs
self._set(**kwargs)
return self

def getMean(self):
return self.getOrDefault(self.mean)

def setMean(self, mean):
self._set(mean=mean)

def getStddev(self):
return self.getOrDefault(self.stddev)

def setStddev(self, stddev):
self._set(stddev=stddev)

def _transform(self, data):
mean = self.getMean()
stddev = self.getStddev()
inputCol = self.getInputCol()
outputCol = self.getOutputCol()

df = data.withColumn(outputCol,
 when(col(inputCol).isNull(),
  stddev * randn() + mean).\
  otherwise(col(inputCol)))
return df

if __name__ == "__main__":

train = spark.createDataFrame([[0],[1],[2]] + [[None]]*100,['input'])
impute = ImputeNormal(inputCol='input', outputCol='output')
impute_model = impute.fit(train)
print("Input column: {}".format(impute_model.getInputCol()))
print("Output column: {}".format(impute_model.getOutputCol()))
print("Mean: {}".format(impute_model.getMean()))
print("Standard Deviation: {}".format(impute_model.getStddev()))
test = impute_model.transform(train)
test.show(10)
test.describe().show()
print("mean and stddev for outputCol should be close to those of 
inputCol"){code}
 

> Helper functions for custom Python Persistence
> 

[jira] [Created] (SPARK-25995) sparkR should ensure user args are after the argument used for the port

2018-11-09 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-25995:
-

 Summary: sparkR should ensure user args are after the argument 
used for the port
 Key: SPARK-25995
 URL: https://issues.apache.org/jira/browse/SPARK-25995
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.3.2
Reporter: Thomas Graves


Currently if you run sparkR and accidentally specify an argument, it fails with 
a useless error message.  For example:

$SPARK_HOME/bin/sparkR  --master yarn --deploy-mode client fooarg

This gets turned into:

Launching java with spark-submit command spark-submit   "--master" "yarn" 
"--deploy-mode" "client" "sparkr-shell" "fooarg" 
/tmp/Rtmp6XBGz2/backend_port162806ea36bca

Notice that "fooarg" got put before /tmp file which is how R and jvm know which 
port to connect to.  SparkR eventually fails with timeout exception after 10 
seconds.  

 

SparkR should either not allow args or make sure the order is correct so the 
backend_port is always first. see 
https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L129



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25994) SPIP: DataFrame-based graph queries and algorithms

2018-11-09 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-25994:
-

 Summary: SPIP: DataFrame-based graph queries and algorithms
 Key: SPARK-25994
 URL: https://issues.apache.org/jira/browse/SPARK-25994
 Project: Spark
  Issue Type: New Feature
  Components: GraphX
Affects Versions: 3.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


[placeholder]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24244) Parse only required columns of CSV file

2018-11-09 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681786#comment-16681786
 ] 

Maxim Gekk commented on SPARK-24244:


> is this new option available in PySpark too?

Yes, it is as well as in R and Scala/Java.

> Parse only required columns of CSV file
> ---
>
> Key: SPARK-24244
> URL: https://issues.apache.org/jira/browse/SPARK-24244
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> uniVocity parser allows to specify only required column names or indexes for 
> parsing like:
> {code}
> // Here we select only the columns by their indexes.
> // The parser just skips the values in other columns
> parserSettings.selectIndexes(4, 0, 1);
> CsvParser parser = new CsvParser(parserSettings);
> {code}
> Need to modify *UnivocityParser* to extract only needed columns from 
> requiredSchema



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25993) Add test cases for resolution of ORC table location

2018-11-09 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25993:
---

 Summary: Add test cases for resolution of ORC table location
 Key: SPARK-25993
 URL: https://issues.apache.org/jira/browse/SPARK-25993
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 2.3.2
Reporter: Xiao Li


Add a test case based on the following example. The behavior was changed in 2.3 
release. We also need to upgrade the migration guide.

{code:java}
val someDF1 = Seq(
  (1, 1, "blah"),
  (1, 2, "blahblah")
).toDF("folder", "number", "word").repartition(1)

someDF1.write.orc("/tmp/orctab1/dir1/")
someDF1.write.orc("/mnt/orctab1/dir2/")

create external table tab1(folder int,number int,word string) STORED AS ORC 
LOCATION '/tmp/orctab1/");
select * from tab1;

create external table tab2(folder int,number int,word string) STORED AS ORC 
LOCATION '/tmp/orctab1/*");
select * from tab2;
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25993) Add test cases for resolution of ORC table location

2018-11-09 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25993:

Labels: starter  (was: )

> Add test cases for resolution of ORC table location
> ---
>
> Key: SPARK-25993
> URL: https://issues.apache.org/jira/browse/SPARK-25993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.2
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> Add a test case based on the following example. The behavior was changed in 
> 2.3 release. We also need to upgrade the migration guide.
> {code:java}
> val someDF1 = Seq(
>   (1, 1, "blah"),
>   (1, 2, "blahblah")
> ).toDF("folder", "number", "word").repartition(1)
> someDF1.write.orc("/tmp/orctab1/dir1/")
> someDF1.write.orc("/mnt/orctab1/dir2/")
> create external table tab1(folder int,number int,word string) STORED AS ORC 
> LOCATION '/tmp/orctab1/");
> select * from tab1;
> create external table tab2(folder int,number int,word string) STORED AS ORC 
> LOCATION '/tmp/orctab1/*");
> select * from tab2;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25979) Window function: allow parentheses around window reference

2018-11-09 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25979.
-
   Resolution: Fixed
 Assignee: Gengliang Wang
Fix Version/s: 3.0.0
   2.4.1

> Window function: allow parentheses around window reference
> --
>
> Key: SPARK-25979
> URL: https://issues.apache.org/jira/browse/SPARK-25979
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 2.4.1, 3.0.0
>
>
> Very minor parser bug, but possibly problematic for code-generated queries:
> Consider the following two queries:
> {code}
> SELECT avg(k) OVER (w) FROM kv WINDOW w AS (PARTITION BY v ORDER BY w) ORDER 
> BY 1
> {code}
> and
> {code}
> SELECT avg(k) OVER w FROM kv WINDOW w AS (PARTITION BY v ORDER BY w) ORDER BY 
> 1
> {code}
> The former, with parens around the OVER condition, fails to parse while the 
> latter, without parens, succeeds:
> {code}
> Error in SQL statement: ParseException: 
> mismatched input '(' expecting {, ',', 'FROM', 'WHERE', 'GROUP', 
> 'ORDER', 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 
> 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 19)
> == SQL ==
> SELECT avg(k) OVER (w) FROM kv WINDOW w AS (PARTITION BY v ORDER BY w) ORDER 
> BY 1
> ---^^^
> {code}
> This was found when running the cockroach DB tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25988) Keep names unchanged when deduplicating the column names in Analyzer

2018-11-09 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25988.
-
   Resolution: Fixed
Fix Version/s: 3.0.0
   2.4.1

> Keep names unchanged when deduplicating the column names in Analyzer
> 
>
> Key: SPARK-25988
> URL: https://issues.apache.org/jira/browse/SPARK-25988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> {code}
> withTempView("tmpView1", "tmpView2") {
>   withTable("tab1", "tab2") {
> sql(
>   """
> |CREATE TABLE `tab1` (`col1` INT, `TDATE` DATE)
> |USING CSV
> |PARTITIONED BY (TDATE)
>   """.stripMargin)
> spark.table("tab1").where("TDATE >= 
> '2017-08-15'").createOrReplaceTempView("tmpView1")
> sql("CREATE TABLE `tab2` (`TDATE` DATE) USING parquet")
> sql(
>   """
> |CREATE OR REPLACE TEMPORARY VIEW tmpView2 AS
> |SELECT N.tdate, col1 AS aliasCol1
> |FROM tmpView1 N
> |JOIN tab2 Z
> |ON N.tdate = Z.tdate
>   """.stripMargin)
> withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "0") {
>   sql("SELECT * FROM tmpView2 x JOIN tmpView2 y ON x.tdate = 
> y.tdate").collect()
> }
>   }
> }
> {code}
> The above code will issue the following error.
> {code}
> Expected only partition pruning predicates: 
> ArrayBuffer(isnotnull(tdate#11986), (cast(tdate#11986 as string) >= 
> 2017-08-15));
> org.apache.spark.sql.AnalysisException: Expected only partition pruning 
> predicates: ArrayBuffer(isnotnull(tdate#11986), (cast(tdate#11986 as string) 
> >= 2017-08-15));
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.prunePartitionsByFilter(ExternalCatalogUtils.scala:146)
>   at 
> org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.listPartitionsByFilter(InMemoryCatalog.scala:560)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:254)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:958)
>   at 
> org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73)
>   at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:63)
>   at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> 

[jira] [Commented] (SPARK-24244) Parse only required columns of CSV file

2018-11-09 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681735#comment-16681735
 ] 

Ruslan Dautkhanov commented on SPARK-24244:
---

[~maxgekk] great improvement 

is this new option available in PySpark too?

 

> Parse only required columns of CSV file
> ---
>
> Key: SPARK-24244
> URL: https://issues.apache.org/jira/browse/SPARK-24244
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> uniVocity parser allows to specify only required column names or indexes for 
> parsing like:
> {code}
> // Here we select only the columns by their indexes.
> // The parser just skips the values in other columns
> parserSettings.selectIndexes(4, 0, 1);
> CsvParser parser = new CsvParser(parserSettings);
> {code}
> Need to modify *UnivocityParser* to extract only needed columns from 
> requiredSchema



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24421) Accessing sun.misc.Cleaner in JDK11

2018-11-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24421:


Assignee: (was: Apache Spark)

> Accessing sun.misc.Cleaner in JDK11
> ---
>
> Key: SPARK-24421
> URL: https://issues.apache.org/jira/browse/SPARK-24421
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Priority: Major
>  Labels: release-notes
>
> Many internal APIs such as unsafe are encapsulated in JDK9+, see 
> http://openjdk.java.net/jeps/260 for detail.
> To use Unsafe, we need to add *jdk.unsupported* to our code’s module 
> declaration:
> {code:java}
> module java9unsafe {
> requires jdk.unsupported;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24421) Accessing sun.misc.Cleaner in JDK11

2018-11-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681722#comment-16681722
 ] 

Apache Spark commented on SPARK-24421:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/22993

> Accessing sun.misc.Cleaner in JDK11
> ---
>
> Key: SPARK-24421
> URL: https://issues.apache.org/jira/browse/SPARK-24421
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Priority: Major
>  Labels: release-notes
>
> Many internal APIs such as unsafe are encapsulated in JDK9+, see 
> http://openjdk.java.net/jeps/260 for detail.
> To use Unsafe, we need to add *jdk.unsupported* to our code’s module 
> declaration:
> {code:java}
> module java9unsafe {
> requires jdk.unsupported;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24421) Accessing sun.misc.Cleaner in JDK11

2018-11-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-24421:
--
Labels: release-notes  (was: )

> Accessing sun.misc.Cleaner in JDK11
> ---
>
> Key: SPARK-24421
> URL: https://issues.apache.org/jira/browse/SPARK-24421
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Priority: Major
>  Labels: release-notes
>
> Many internal APIs such as unsafe are encapsulated in JDK9+, see 
> http://openjdk.java.net/jeps/260 for detail.
> To use Unsafe, we need to add *jdk.unsupported* to our code’s module 
> declaration:
> {code:java}
> module java9unsafe {
> requires jdk.unsupported;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24421) Accessing sun.misc.Cleaner in JDK11

2018-11-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24421:


Assignee: Apache Spark

> Accessing sun.misc.Cleaner in JDK11
> ---
>
> Key: SPARK-24421
> URL: https://issues.apache.org/jira/browse/SPARK-24421
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Assignee: Apache Spark
>Priority: Major
>  Labels: release-notes
>
> Many internal APIs such as unsafe are encapsulated in JDK9+, see 
> http://openjdk.java.net/jeps/260 for detail.
> To use Unsafe, we need to add *jdk.unsupported* to our code’s module 
> declaration:
> {code:java}
> module java9unsafe {
> requires jdk.unsupported;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23814) Couldn't read file with colon in name and new line character in one of the field.

2018-11-09 Thread Julia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681712#comment-16681712
 ] 

Julia commented on SPARK-23814:
---

[~hyukjin.kwon] Still got the same error with Spark 2.3.1.

> Couldn't read file with colon in name and new line character in one of the 
> field.
> -
>
> Key: SPARK-23814
> URL: https://issues.apache.org/jira/browse/SPARK-23814
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell
>Affects Versions: 2.2.0
>Reporter: bharath kumar avusherla
>Priority: Major
>
> When the file name has colon and new line character in data, while reading 
> using spark.read.option("multiLine","true").csv("s3n://DirectoryPath/") 
> function. It is throwing *"**java.lang.IllegalArgumentException: 
> java.net.URISyntaxException: Relative path in absolute URI: 
> 2017-08-01T00:00:00Z.csv.gz"* error. If we remove the 
> option("multiLine","true"), it is working just fine though the file name has 
> colon in it. It is working fine, If i apply this option 
> *option("multiLine","true")* on any other file which doesn't have colon in 
> it. But when both are present (colon in file name and new line in the data), 
> it's not working.
> {quote}java.lang.IllegalArgumentException: java.net.URISyntaxException: 
> Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz
>   at org.apache.hadoop.fs.Path.initialize(Path.java:205)
>   at org.apache.hadoop.fs.Path.(Path.java:171)
>   at org.apache.hadoop.fs.Path.(Path.java:93)
>   at org.apache.hadoop.fs.Globber.glob(Globber.java:253)
>   at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1676)
>   at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294)
>   at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>   at 
> org.apache.spark.input.StreamFileInputFormat.setMinPartitions(PortableDataStream.scala:51)
>   at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:46)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
>   at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1333)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.take(RDD.scala:1327)
>   at 
> org.apache.spark.sql.execution.datasources.csv.MultiLineCSVDataSource$.infer(CSVDataSource.scala:224)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:176)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)
>   ... 48 elided
> Caused by: java.net.URISyntaxException: Relative path in absolute URI: 
> 2017-08-01T00:00:00Z.csv.gz
>   at java.net.URI.checkPath(URI.java:1823)
>   at java.net.URI.(URI.java:745)
>   at org.apache.hadoop.fs.Path.initialize(Path.java:202)
>   ... 86 more
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To 

[jira] [Commented] (SPARK-25696) The storage memory displayed on spark Application UI is incorrect.

2018-11-09 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681500#comment-16681500
 ] 

Sean Owen commented on SPARK-25696:
---

Per the pull request -- the error is actually slightly different. Yes 1024 
should be the factor, but, all the units need to be displayed as kibibytes, 
etc. KiB, GiB and so on. Just changing the 1000 is wrong.

> The storage memory displayed on spark Application UI is incorrect.
> --
>
> Key: SPARK-25696
> URL: https://issues.apache.org/jira/browse/SPARK-25696
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: hantiantian
>Priority: Major
>
> In the reported heartbeat information, the unit of the memory data is bytes, 
> which is converted by the formatBytes() function in the utils.js file before 
> being displayed in the interface. The cardinality of the unit conversion in 
> the formatBytes function is 1000, which should be 1024.
> function formatBytes(bytes, type)
> {    if (type !== 'display') return bytes;    if (bytes == 0) return '0.0 B'; 
>    var k = 1000;    var dm = 1;    var sizes = ['B', 'KB', 'MB', 'GB', 'TB', 
> 'PB', 'EB', 'ZB', 'YB'];    var i = Math.floor(Math.log(bytes) / 
> Math.log(k)); return parseFloat((bytes / Math.pow(k, i)).toFixed(dm)) + ' ' + 
> sizes[i]; }
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25973) Spark History Main page performance improvement

2018-11-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25973.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22982
[https://github.com/apache/spark/pull/22982]

> Spark History Main page performance improvement
> ---
>
> Key: SPARK-25973
> URL: https://issues.apache.org/jira/browse/SPARK-25973
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: William Montaz
>Assignee: William Montaz
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: fix.patch
>
>
> HistoryPage.scala counts applications (with a predicate depending on if it is 
> displaying incomplete or complete applications) to check if it must display 
> the dataTable.
> Since it only checks if allAppsSize > 0, we could use exists method on the 
> iterator. This way we stop iterating at the first occurence found.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25973) Spark History Main page performance improvement

2018-11-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25973:
-

Assignee: William Montaz

> Spark History Main page performance improvement
> ---
>
> Key: SPARK-25973
> URL: https://issues.apache.org/jira/browse/SPARK-25973
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: William Montaz
>Assignee: William Montaz
>Priority: Minor
> Attachments: fix.patch
>
>
> HistoryPage.scala counts applications (with a predicate depending on if it is 
> displaying incomplete or complete applications) to check if it must display 
> the dataTable.
> Since it only checks if allAppsSize > 0, we could use exists method on the 
> iterator. This way we stop iterating at the first occurence found.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25992) Accumulators giving KeyError in pyspark

2018-11-09 Thread Abdeali Kothari (JIRA)
Abdeali Kothari created SPARK-25992:
---

 Summary: Accumulators giving KeyError in pyspark
 Key: SPARK-25992
 URL: https://issues.apache.org/jira/browse/SPARK-25992
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.1
Reporter: Abdeali Kothari


I am using accumulators and when I run my code, I sometimes get some warn 
messages. When I checked, there was nothing accumulated - not sure if I lost 
info from the accumulator or it worked and I can ignore this error ?

The message:
{noformat}
Exception happened during processing of request from
('127.0.0.1', 62099)
Traceback (most recent call last):
File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 317, in 
_handle_request_noblock
self.process_request(request, client_address)
File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 348, in 
process_request
self.finish_request(request, client_address)
File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 361, in 
finish_request
self.RequestHandlerClass(request, client_address, self)
File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 696, in 
__init__
self.handle()
File "/usr/local/hadoop/spark2.3.1/python/pyspark/accumulators.py", line 238, 
in handle
_accumulatorRegistry[aid] += update
KeyError: 0

2018-11-09 19:09:08 ERROR DAGScheduler:91 - Failed to update accumulators for 
task 0
org.apache.spark.SparkException: EOF reached before Python server acknowledged
at 
org.apache.spark.api.python.PythonAccumulatorV2.merge(PythonRDD.scala:634)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1131)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1123)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1123)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1206)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25991) Update binary for 2.4.0 release

2018-11-09 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-25991.
-
Resolution: Invalid

> Update binary for 2.4.0 release
> ---
>
> Key: SPARK-25991
> URL: https://issues.apache.org/jira/browse/SPARK-25991
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Vladimir Tsvetkov
>Priority: Major
> Attachments: image-2018-11-09-20-12-47-245.png
>
>
> Archive with 2.4.0 release contains old binaries 
> https://spark.apache.org/downloads.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25991) Update binary for 2.4.0 release

2018-11-09 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-25991:

Attachment: image-2018-11-09-20-12-47-245.png

> Update binary for 2.4.0 release
> ---
>
> Key: SPARK-25991
> URL: https://issues.apache.org/jira/browse/SPARK-25991
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Vladimir Tsvetkov
>Priority: Major
> Attachments: image-2018-11-09-20-12-47-245.png
>
>
> Archive with 2.4.0 release contains old binaries 
> https://spark.apache.org/downloads.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25991) Update binary for 2.4.0 release

2018-11-09 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681371#comment-16681371
 ] 

Yuming Wang commented on SPARK-25991:
-

Please check your SPARK_HOME:
!image-2018-11-09-20-12-47-245.png!

> Update binary for 2.4.0 release
> ---
>
> Key: SPARK-25991
> URL: https://issues.apache.org/jira/browse/SPARK-25991
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Vladimir Tsvetkov
>Priority: Major
> Attachments: image-2018-11-09-20-12-47-245.png
>
>
> Archive with 2.4.0 release contains old binaries 
> https://spark.apache.org/downloads.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25991) Update binary for 2.4.0 release

2018-11-09 Thread Vladimir Tsvetkov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681306#comment-16681306
 ] 

Vladimir Tsvetkov commented on SPARK-25991:
---

[~yumwang] sounds strange, but I run spark-submit --version and I saw 2.3 
version. May be I messed with my paths. Please close this issue. Thanks

> Update binary for 2.4.0 release
> ---
>
> Key: SPARK-25991
> URL: https://issues.apache.org/jira/browse/SPARK-25991
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Vladimir Tsvetkov
>Priority: Major
>
> Archive with 2.4.0 release contains old binaries 
> https://spark.apache.org/downloads.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25991) Update binary for 2.4.0 release

2018-11-09 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681300#comment-16681300
 ] 

Yuming Wang commented on SPARK-25991:
-

Sorry. I do not understand what you mean.

> Update binary for 2.4.0 release
> ---
>
> Key: SPARK-25991
> URL: https://issues.apache.org/jira/browse/SPARK-25991
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Vladimir Tsvetkov
>Priority: Major
>
> Archive with 2.4.0 release contains old binaries 
> https://spark.apache.org/downloads.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2018-11-09 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681297#comment-16681297
 ] 

Steve Loughran commented on SPARK-25966:


bq. Hadoop 3.1.x is not yet officially supported in Spark.

true, but there were some changes in that S3A input stream which it is good to 
see if it caused this

h3.  better recovery of failures in the underlying read() call

Before: 
[https://github.com/apache/hadoop/blob/branch-2/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java#L382]

After: 
[https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java#L364]

h3. AWS SDK++

an update to a more recent AWS SDK. (1.11.271), which complains a lot more if 
you close in input stream while there's data

h3.  Adaptive seek policy

When you start off with fadvise=normal the first read is the full file, but if 
you do a backward seek is switches to random IO 
(fs.s3a.experimental.fadvise=random): 
[https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java#L281]
 

Unless the fadvise=random is set (Best) or fadvise=sequential (completely wrong 
for striped columnar formats), the parquet reader is following that codepath.

[~andrioni]:  can you put the log {{org.apache.hadoop.fs.s3a.S3AInputStream}} 
into DEBUG and see what it says on these failures?

> "EOF Reached the end of stream with bytes left to read" while reading/writing 
> to Parquets
> -
>
> Key: SPARK-25966
> URL: https://issues.apache.org/jira/browse/SPARK-25966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
> top of a Mesos cluster. Both input and output Parquet files are on S3.
>Reporter: Alessandro Andrioni
>Priority: Major
>
> I was persistently getting the following exception while trying to run one 
> Spark job we have using Spark 2.4.0. It went away after I regenerated from 
> scratch all the input Parquet files (generated by another Spark job also 
> using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet 
> files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
>   (...)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 312 in stage 

[jira] [Created] (SPARK-25991) Update binary for 2.4.0 release

2018-11-09 Thread Vladimir Tsvetkov (JIRA)
Vladimir Tsvetkov created SPARK-25991:
-

 Summary: Update binary for 2.4.0 release
 Key: SPARK-25991
 URL: https://issues.apache.org/jira/browse/SPARK-25991
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Vladimir Tsvetkov


Archive with 2.4.0 release contains old binaries 
https://spark.apache.org/downloads.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24421) Accessing sun.misc.Cleaner in JDK11

2018-11-09 Thread Alan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681151#comment-16681151
 ] 

Alan commented on SPARK-24421:
--

If I understanding correctly, the high-level need is 
-XX:MaxDirectMemorySize=unlimited but without specifying a command line option. 
Do you specify any other arguments? Maybe you could include an arg file with 
all options?

As regards the hack then it looks like it involves the non-public constructor 
needed for JNI NewDirectMemoryBuffer and then patching the cleaner field. Ugh, 
that it way too fragile as the JDK internals can change at any time, also 
hacking into buffer fields will break once java.base is fully encapsulated.

 

 

> Accessing sun.misc.Cleaner in JDK11
> ---
>
> Key: SPARK-24421
> URL: https://issues.apache.org/jira/browse/SPARK-24421
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Priority: Major
>
> Many internal APIs such as unsafe are encapsulated in JDK9+, see 
> http://openjdk.java.net/jeps/260 for detail.
> To use Unsafe, we need to add *jdk.unsupported* to our code’s module 
> declaration:
> {code:java}
> module java9unsafe {
> requires jdk.unsupported;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22737) Simplity OneVsRest transform

2018-11-09 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-22737.
--
Resolution: Not A Problem

> Simplity OneVsRest transform
> 
>
> Key: SPARK-22737
> URL: https://issues.apache.org/jira/browse/SPARK-22737
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Priority: Major
>
> Current impl of OneVsRest#transform is over-complicated. It sequentially 
> updates an acumulated column.
> By using a direct UDF of prediction, we obtain a speedup of at least 2x.
> On some extreme case with 20 classes, it obtain about 14x speedup.
> The test code and performance comparsion details are in the corresponding PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12216) Spark failed to delete temp directory

2018-11-09 Thread Deej (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681103#comment-16681103
 ] 

Deej edited comment on SPARK-12216 at 11/9/18 9:08 AM:
---

This issue has *NOT* been fixed, so marking it as Resolved is plain silly. 
Moreover, suggesting users to switch to other OSes is not only reckless but 
also regressive when there is a large community of users attempting to adopt 
Spark as one of their large scale data processing tools. So please stop with 
the condescension and work on fixing this bug as the community has been 
expecting for a long while now.

 

As others have reported, I am able to successfully launch spark-shell and 
perform basic tasks (including sc.stop()) successfully. However, the moment I 
try to quit the repl session, it craps out immediately. Also, I am able to 
manually delete the said temp files/folders Spark creates in the temp directory 
so there are no permissions issues. Even executing these commands from a 
command prompt running as Administrator results in the same error, reinforcing 
the assumption that this is not related to permissions on the temp folder at 
all.

Here is my set-up to reproduce this issue:-

OS: Windows 10

Spark: version 2.3.2

 /_/
 Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171)
  
 Stack trace:
 ===
 scala> sc
 res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@41167ded
 scala> sc.stop()
 scala> :quit
 2018-11-09 00:10:42 ERROR ShutdownHookManager:91 - Exception while deleting 
Spark temp dir: 
C:\Users\user1\AppData\Local\Temp\spark-b155db59-b7c5-4f64-8cfb-00d8f95ea348\repl-fed61a6e-3a1e-46cf-90e9-3fbfcb8a1d87
 java.io.IOException: Failed to delete: 
C:\Users\{color:#33}user1\AppData\Local\Temp\spark-b155db59-b7c5-4f64-8cfb-00d8f95ea348\repl-fed61a6e-3a1e-46cf-90e9-3fbfcb8a1d87
     at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1074)
     at 
org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:65)
     at 
org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:62)
     at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
     at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
     at 
org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:62)
     at 
org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
     at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
 


was (Author: laal):
This issue has *NOT* been fixed, so marking it as Resolved is plain silly. 
Moreover, suggesting users to switch to other OSes is not only reckless but 
also regressive when there is a large community of users attempting to adopt 
Spark as one of their large scale data processing tools. So please stop with 
the condescension and work on fixing this bug as the community has been 
expecting for a long while now.

 

As others have reported, I am able to successfully launch spark-shell and 
perform basic tasks (including sc.stop()) successfully. However, the moment I 
try to quit the repl session, it craps out immediately.

Here is my set-up to reproduce this issue:-

OS: Windows 10

Spark: version 2.3.2

 /_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171)
 
Stack trace:
===
scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@41167ded
scala> sc.stop()
scala> :quit
2018-11-09 00:10:42 ERROR ShutdownHookManager:91 - Exception while deleting 
Spark temp dir: 
C:\Users\user1\AppData\Local\Temp\spark-b155db59-b7c5-4f64-8cfb-00d8f95ea348\repl-fed61a6e-3a1e-46cf-90e9-3fbfcb8a1d87
java.io.IOException: Failed to delete: 
C:\Users\{color:#33}user1{color}\AppData\Local\Temp\spark-b155db59-b7c5-4f64-8cfb-00d8f95ea348\repl-fed61a6e-3a1e-46cf-90e9-3fbfcb8a1d87
    at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1074)
    at 
org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:65)
    at 
org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:62)
    at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
    at 
org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:62)
    at 
org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
    at 

[jira] [Commented] (SPARK-12216) Spark failed to delete temp directory

2018-11-09 Thread Deej (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681103#comment-16681103
 ] 

Deej commented on SPARK-12216:
--

This issue has *NOT* been fixed, so marking it as Resolved is plain silly. 
Moreover, suggesting users to switch to other OSes is not only reckless but 
also regressive when there is a large community of users attempting to adopt 
Spark as one of their large scale data processing tools. So please stop with 
the condescension and work on fixing this bug as the community has been 
expecting for a long while now.

 

As others have reported, I am able to successfully launch spark-shell and 
perform basic tasks (including sc.stop()) successfully. However, the moment I 
try to quit the repl session, it craps out immediately.

Here is my set-up to reproduce this issue:-

OS: Windows 10

Spark: version 2.3.2

 /_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171)
 
Stack trace:
===
scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@41167ded
scala> sc.stop()
scala> :quit
2018-11-09 00:10:42 ERROR ShutdownHookManager:91 - Exception while deleting 
Spark temp dir: 
C:\Users\user1\AppData\Local\Temp\spark-b155db59-b7c5-4f64-8cfb-00d8f95ea348\repl-fed61a6e-3a1e-46cf-90e9-3fbfcb8a1d87
java.io.IOException: Failed to delete: 
C:\Users\{color:#33}user1{color}\AppData\Local\Temp\spark-b155db59-b7c5-4f64-8cfb-00d8f95ea348\repl-fed61a6e-3a1e-46cf-90e9-3fbfcb8a1d87
    at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1074)
    at 
org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:65)
    at 
org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:62)
    at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
    at 
org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:62)
    at 
org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
    at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)


> Spark failed to delete temp directory 
> --
>
> Key: SPARK-12216
> URL: https://issues.apache.org/jira/browse/SPARK-12216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
> Environment: windows 7 64 bit
> Spark 1.52
> Java 1.8.0.65
> PATH includes:
> C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin
> C:\ProgramData\Oracle\Java\javapath
> C:\Users\Stefan\scala\bin
> SYSTEM variables set are:
> JAVA_HOME=C:\Program Files\Java\jre1.8.0_65
> HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin
> (where the bin\winutils resides)
> both \tmp and \tmp\hive have permissions
> drwxrwxrwx as detected by winutils ls
>Reporter: stefan
>Priority: Minor
>
> The mailing list archives have no obvious solution to this:
> scala> :q
> Stopping spark context.
> 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark 
> temp dir: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> java.io.IOException: Failed to delete: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60)
> at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60)
> at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> 

[jira] [Assigned] (SPARK-24229) Upgrade to the latest Apache Thrift 0.10.0 release

2018-11-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24229:


Assignee: (was: Apache Spark)

> Upgrade to the latest Apache Thrift 0.10.0 release
> --
>
> Key: SPARK-24229
> URL: https://issues.apache.org/jira/browse/SPARK-24229
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.0
>Reporter: Ray Donnelly
>Priority: Critical
>
> According to [https://www.cvedetails.com/cve/CVE-2016-5397/]
>  
> .. there are critical vulnerabilities in libthrift 0.9.3 currently vendored 
> in Apache Spark (and then, for us, into PySpark).
>  
> Can anyone help to assess the seriousness of this and what should be done 
> about it?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24229) Upgrade to the latest Apache Thrift 0.10.0 release

2018-11-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681099#comment-16681099
 ] 

Apache Spark commented on SPARK-24229:
--

User 'Fokko' has created a pull request for this issue:
https://github.com/apache/spark/pull/22992

> Upgrade to the latest Apache Thrift 0.10.0 release
> --
>
> Key: SPARK-24229
> URL: https://issues.apache.org/jira/browse/SPARK-24229
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.0
>Reporter: Ray Donnelly
>Priority: Critical
>
> According to [https://www.cvedetails.com/cve/CVE-2016-5397/]
>  
> .. there are critical vulnerabilities in libthrift 0.9.3 currently vendored 
> in Apache Spark (and then, for us, into PySpark).
>  
> Can anyone help to assess the seriousness of this and what should be done 
> about it?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24229) Upgrade to the latest Apache Thrift 0.10.0 release

2018-11-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24229:


Assignee: Apache Spark

> Upgrade to the latest Apache Thrift 0.10.0 release
> --
>
> Key: SPARK-24229
> URL: https://issues.apache.org/jira/browse/SPARK-24229
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.0
>Reporter: Ray Donnelly
>Assignee: Apache Spark
>Priority: Critical
>
> According to [https://www.cvedetails.com/cve/CVE-2016-5397/]
>  
> .. there are critical vulnerabilities in libthrift 0.9.3 currently vendored 
> in Apache Spark (and then, for us, into PySpark).
>  
> Can anyone help to assess the seriousness of this and what should be done 
> about it?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25973) Spark History Main page performance improvement

2018-11-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25973:


Assignee: (was: Apache Spark)

> Spark History Main page performance improvement
> ---
>
> Key: SPARK-25973
> URL: https://issues.apache.org/jira/browse/SPARK-25973
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: William Montaz
>Priority: Minor
> Attachments: fix.patch
>
>
> HistoryPage.scala counts applications (with a predicate depending on if it is 
> displaying incomplete or complete applications) to check if it must display 
> the dataTable.
> Since it only checks if allAppsSize > 0, we could use exists method on the 
> iterator. This way we stop iterating at the first occurence found.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25973) Spark History Main page performance improvement

2018-11-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25973:


Assignee: Apache Spark

> Spark History Main page performance improvement
> ---
>
> Key: SPARK-25973
> URL: https://issues.apache.org/jira/browse/SPARK-25973
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: William Montaz
>Assignee: Apache Spark
>Priority: Minor
> Attachments: fix.patch
>
>
> HistoryPage.scala counts applications (with a predicate depending on if it is 
> displaying incomplete or complete applications) to check if it must display 
> the dataTable.
> Since it only checks if allAppsSize > 0, we could use exists method on the 
> iterator. This way we stop iterating at the first occurence found.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25973) Spark History Main page performance improvement

2018-11-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681054#comment-16681054
 ] 

Apache Spark commented on SPARK-25973:
--

User 'Willymontaz' has created a pull request for this issue:
https://github.com/apache/spark/pull/22982

> Spark History Main page performance improvement
> ---
>
> Key: SPARK-25973
> URL: https://issues.apache.org/jira/browse/SPARK-25973
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: William Montaz
>Priority: Minor
> Attachments: fix.patch
>
>
> HistoryPage.scala counts applications (with a predicate depending on if it is 
> displaying incomplete or complete applications) to check if it must display 
> the dataTable.
> Since it only checks if allAppsSize > 0, we could use exists method on the 
> iterator. This way we stop iterating at the first occurence found.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25989) OneVsRestModel handle empty outputCols incorrectly

2018-11-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25989:


Assignee: (was: Apache Spark)

> OneVsRestModel handle empty outputCols incorrectly
> --
>
> Key: SPARK-25989
> URL: https://issues.apache.org/jira/browse/SPARK-25989
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>
> {\{ml.classification.ClassificationModel}} will ignore empty output columns.
> However, \{{OneVsRestModel}} still try to append new column even if its name 
> is an empty string.
> {code:java}
> scala> ovrModel.setPredictionCol("").transform(test).show
> +-+++---+
> |label| features| rawPrediction| |
> +-+++---+
> | 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0|
> | 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0|
> | 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0|
> | 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0|
> | 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0|
> | 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0|
> | 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0|
> | 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0|
> | 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0|
> | 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0|
> | 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0|
> | 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0|
> +-+++---+
> only showing top 20 rows
> scala> 
> ovrModel.setPredictionCol("").setRawPredictionCol("raw").transform(test).show
> +-+++---+
> |label| features| raw| |
> +-+++---+
> | 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0|
> | 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0|
> | 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0|
> | 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0|
> | 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0|
> | 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0|
> | 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0|
> | 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0|
> | 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0|
> | 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0|
> | 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0|
> | 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0|
> +-+++---+
> only showing top 20 rows
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25852) we should filter the workOffers with freeCores>=CPUS_PER_TASK at first for better performance

2018-11-09 Thread zuotingbing (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-25852:

Priority: Trivial  (was: Major)

> we should filter the workOffers with freeCores>=CPUS_PER_TASK at first for 
> better performance
> -
>
> Key: SPARK-25852
> URL: https://issues.apache.org/jira/browse/SPARK-25852
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.3.2
>Reporter: zuotingbing
>Priority: Trivial
> Attachments: 2018-10-26_162822.png
>
>
> We should filter the workOffers with freeCores>=CPUS_PER_TASK for better 
> performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25990) TRANSFORM should handle different data types correctly

2018-11-09 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-25990:
---

 Summary: TRANSFORM should handle different data types correctly
 Key: SPARK-25990
 URL: https://issues.apache.org/jira/browse/SPARK-25990
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25989) OneVsRestModel handle empty outputCols incorrectly

2018-11-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681020#comment-16681020
 ] 

Apache Spark commented on SPARK-25989:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/22991

> OneVsRestModel handle empty outputCols incorrectly
> --
>
> Key: SPARK-25989
> URL: https://issues.apache.org/jira/browse/SPARK-25989
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>
> {\{ml.classification.ClassificationModel}} will ignore empty output columns.
> However, \{{OneVsRestModel}} still try to append new column even if its name 
> is an empty string.
> {code:java}
> scala> ovrModel.setPredictionCol("").transform(test).show
> +-+++---+
> |label| features| rawPrediction| |
> +-+++---+
> | 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0|
> | 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0|
> | 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0|
> | 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0|
> | 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0|
> | 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0|
> | 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0|
> | 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0|
> | 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0|
> | 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0|
> | 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0|
> | 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0|
> +-+++---+
> only showing top 20 rows
> scala> 
> ovrModel.setPredictionCol("").setRawPredictionCol("raw").transform(test).show
> +-+++---+
> |label| features| raw| |
> +-+++---+
> | 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0|
> | 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0|
> | 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0|
> | 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0|
> | 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0|
> | 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0|
> | 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0|
> | 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0|
> | 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0|
> | 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0|
> | 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0|
> | 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0|
> +-+++---+
> only showing top 20 rows
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25989) OneVsRestModel handle empty outputCols incorrectly

2018-11-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681018#comment-16681018
 ] 

Apache Spark commented on SPARK-25989:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/22991

> OneVsRestModel handle empty outputCols incorrectly
> --
>
> Key: SPARK-25989
> URL: https://issues.apache.org/jira/browse/SPARK-25989
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>
> {\{ml.classification.ClassificationModel}} will ignore empty output columns.
> However, \{{OneVsRestModel}} still try to append new column even if its name 
> is an empty string.
> {code:java}
> scala> ovrModel.setPredictionCol("").transform(test).show
> +-+++---+
> |label| features| rawPrediction| |
> +-+++---+
> | 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0|
> | 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0|
> | 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0|
> | 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0|
> | 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0|
> | 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0|
> | 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0|
> | 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0|
> | 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0|
> | 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0|
> | 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0|
> | 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0|
> +-+++---+
> only showing top 20 rows
> scala> 
> ovrModel.setPredictionCol("").setRawPredictionCol("raw").transform(test).show
> +-+++---+
> |label| features| raw| |
> +-+++---+
> | 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0|
> | 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0|
> | 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0|
> | 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0|
> | 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0|
> | 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0|
> | 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0|
> | 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0|
> | 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0|
> | 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0|
> | 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0|
> | 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0|
> +-+++---+
> only showing top 20 rows
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25989) OneVsRestModel handle empty outputCols incorrectly

2018-11-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25989:


Assignee: Apache Spark

> OneVsRestModel handle empty outputCols incorrectly
> --
>
> Key: SPARK-25989
> URL: https://issues.apache.org/jira/browse/SPARK-25989
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Minor
>
> {\{ml.classification.ClassificationModel}} will ignore empty output columns.
> However, \{{OneVsRestModel}} still try to append new column even if its name 
> is an empty string.
> {code:java}
> scala> ovrModel.setPredictionCol("").transform(test).show
> +-+++---+
> |label| features| rawPrediction| |
> +-+++---+
> | 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0|
> | 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0|
> | 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0|
> | 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0|
> | 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0|
> | 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0|
> | 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0|
> | 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0|
> | 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0|
> | 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0|
> | 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0|
> | 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0|
> +-+++---+
> only showing top 20 rows
> scala> 
> ovrModel.setPredictionCol("").setRawPredictionCol("raw").transform(test).show
> +-+++---+
> |label| features| raw| |
> +-+++---+
> | 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0|
> | 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0|
> | 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0|
> | 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0|
> | 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0|
> | 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0|
> | 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0|
> | 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0|
> | 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0|
> | 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0|
> | 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0|
> | 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0|
> +-+++---+
> only showing top 20 rows
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25989) OneVsRestModel handle empty outputCols incorrectly

2018-11-09 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-25989:
-
Priority: Minor  (was: Major)

> OneVsRestModel handle empty outputCols incorrectly
> --
>
> Key: SPARK-25989
> URL: https://issues.apache.org/jira/browse/SPARK-25989
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>
> {\{ml.classification.ClassificationModel}} will ignore empty output columns.
> However, \{{OneVsRestModel}} still try to append new column even if its name 
> is an empty string.
> {code:java}
> scala> ovrModel.setPredictionCol("").transform(test).show
> +-+++---+
> |label| features| rawPrediction| |
> +-+++---+
> | 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0|
> | 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0|
> | 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0|
> | 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0|
> | 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0|
> | 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0|
> | 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0|
> | 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0|
> | 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0|
> | 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0|
> | 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0|
> | 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0|
> +-+++---+
> only showing top 20 rows
> scala> 
> ovrModel.setPredictionCol("").setRawPredictionCol("raw").transform(test).show
> +-+++---+
> |label| features| raw| |
> +-+++---+
> | 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0|
> | 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0|
> | 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0|
> | 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0|
> | 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0|
> | 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0|
> | 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0|
> | 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0|
> | 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0|
> | 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0|
> | 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0|
> | 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0|
> +-+++---+
> only showing top 20 rows
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25989) OneVsRestModel handle empty outputCols incorrectly

2018-11-09 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-25989:


 Summary: OneVsRestModel handle empty outputCols incorrectly
 Key: SPARK-25989
 URL: https://issues.apache.org/jira/browse/SPARK-25989
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 3.0.0
Reporter: zhengruifeng


{\{ml.classification.ClassificationModel}} will ignore empty output columns.

However, \{{OneVsRestModel}} still try to append new column even if its name is 
an empty string.
{code:java}

scala> ovrModel.setPredictionCol("").transform(test).show
+-+++---+
|label| features| rawPrediction| |
+-+++---+
| 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0|
| 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0|
| 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0|
| 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0|
| 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0|
| 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0|
| 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0|
| 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0|
| 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0|
| 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0|
| 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0|
| 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0|
| 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0|
| 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0|
| 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0|
| 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0|
| 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0|
| 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0|
| 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0|
| 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0|
+-+++---+
only showing top 20 rows


scala> 
ovrModel.setPredictionCol("").setRawPredictionCol("raw").transform(test).show
+-+++---+
|label| features| raw| |
+-+++---+
| 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0|
| 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0|
| 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0|
| 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0|
| 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0|
| 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0|
| 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0|
| 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0|
| 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0|
| 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0|
| 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0|
| 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0|
| 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0|
| 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0|
| 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0|
| 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0|
| 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0|
| 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0|
| 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0|
| 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0|
+-+++---+
only showing top 20 rows
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org