[jira] [Commented] (SPARK-24417) Build and Run Spark on JDK9+

2018-10-17 Thread Adrian Cole (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654705#comment-16654705
 ] 

Adrian Cole commented on SPARK-24417:
-

yes please skip to 11 as in Zipkin we noticed some things break on 11 but not 
9, and 11 is a long term release (also first patch is out). I suspect you'll 
get a surge of demand. For example, spark is our only code which can't run on 
JRE 11 at the moment.

> Build and Run Spark on JDK9+
> 
>
> Key: SPARK-24417
> URL: https://issues.apache.org/jira/browse/SPARK-24417
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Major
>
> This is an umbrella JIRA for Apache Spark to support Java 9+
> As Java 8 is going away soon, Java 11 will be LTS and GA this Sep, and many 
> companies are testing Java 9 or Java 10 to prepare for Java 11, i's best to 
> start the traumatic process of supporting newer version of Java in Apache 
> Spark as a background activity. 
> The subtasks are what have to be done to support Java 9+.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25740) Refactor DetermineTableStats to invalidate cache when some configuration changed

2018-10-17 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-25740:

Summary: Refactor DetermineTableStats to invalidate cache when some 
configuration changed  (was: Set some configuration need invalidateStatsCache)

> Refactor DetermineTableStats to invalidate cache when some configuration 
> changed
> 
>
> Key: SPARK-25740
> URL: https://issues.apache.org/jira/browse/SPARK-25740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:sql}
> # spark-sql
> create table t1 (a int) stored as parquet;
> create table t2 (a int) stored as parquet;
> insert into table t1 values (1);
> insert into table t2 values (1);
> exit;
> spark-sql
> set spark.sql.statistics.fallBackToHdfs=true;
> explain select * from t1, t2 where t1.a = t2.a;
> -- BroadcastHashJoin
> exit;
> spark-sql
> explain select * from t1, t2 where t1.a = t2.a;
> -- SortMergeJoin
> set spark.sql.statistics.fallBackToHdfs=true;
> explain select * from t1, t2 where t1.a = t2.a;
> -- SortMergeJoin, it should be BroadcastHashJoin
> exit;
> {code}
> We need {{LogicalPlanStats.invalidateStatsCache}}, but seems only we can do 
> is invalidateAllCachedTables when execute set Command:
> {code:java}
> val isInvalidateAllCachedTablesKeys = Set(
>   SQLConf.ENABLE_FALL_BACK_TO_HDFS_FOR_STATS.key,
>   SQLConf.DEFAULT_SIZE_IN_BYTES.key
> )
> sparkSession.conf.set(key, value)
> if (isInvalidateAllCachedTablesKeys.contains(key)) {
>   sparkSession.sessionState.catalog.invalidateAllCachedTables()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25762) Upgrade guava version in spark dependency lists due to CVE issue

2018-10-17 Thread Debojyoti (JIRA)
Debojyoti created SPARK-25762:
-

 Summary: Upgrade guava version in spark dependency lists due to  
CVE issue
 Key: SPARK-25762
 URL: https://issues.apache.org/jira/browse/SPARK-25762
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Spark Core
Affects Versions: 2.3.2, 2.3.1, 2.2.2, 2.2.1
Reporter: Debojyoti


In spark2.x dependency list we have guava-14.0.1.jar. However there are lot 
vulnerabilities exists in this version.eg. CVE-2018-10237

[https://www.cvedetails.com/cve/CVE-2018-10237/]

Do we have any solution to resolve it or is there any plan to upgrade guava 
version any of the spark's future release?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25761) sparksql执行sql语句的时候,sql语句已经执行成功,但是从sparkui上看该语句还是没有执行完成,还是running状态。

2018-10-17 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654635#comment-16654635
 ] 

Yuming Wang commented on SPARK-25761:
-

Could you translate to English and provide more infos, logs or UIs ...

> sparksql执行sql语句的时候,sql语句已经执行成功,但是从sparkui上看该语句还是没有执行完成,还是running状态。
> ---
>
> Key: SPARK-25761
> URL: https://issues.apache.org/jira/browse/SPARK-25761
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: hanrentong
>Priority: Major
>
> 我去sparksql上执行sql语句,sql语句已经执行完了,但是从sparkui上看语句还是running状态,并且也kill不掉



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10816) EventTime based sessionization

2018-10-17 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654627#comment-16654627
 ] 

Jungtaek Lim edited comment on SPARK-10816 at 10/18/18 4:36 AM:


Just ran another performance test to check my new trial of improving state.

Here I try to overwrite values for given key instead of removing all values and 
append new values for given key.
https://github.com/HeartSaVioR/spark/commit/6d466b9f424ae6a2b5a927e650f60ef35cfe30ca

The result was no luck (small performance hit compared to current), hence I 
would not put the numbers for that here. But I've run the test from AWS 
c5d.xlarge with dedicated option, hence more isolated and stable env. compared 
to before, which shows higher input rate.

Test Env.: c5d.xlarge, dedicated

A .plenty of sessions

1. HWX (Append Mode) 

1.a. input rate 2

||batch id||input rows||input rows per second||processed rows per second||
| 21 | 113355 | 20234.7375937 | 19278.0612245 |
| 22 | 118905 | 20218.5002551 | 17958.7675578 |
| 23 | 12 | 18121.4134703 | 15622.9657597 |
| 24 | 16 | 20827.9093986 | 14406.6270484 |
| 25 | 22 | 19807.3287116 | 12593.0165999 |

2. Baidu (Append Mode)

2.a. input rate 15000

||batch id||input rows||input rows per second||processed rows per second||
| 18 | 1005000 | 15068.3699172 | 5993.05878565 |
| 19 | 2505000 | 14937.8335669 | 4823.00254531 |

(cancelled since following batch takes too long... it even can't reach 1)

3. HWX (Update Mode)

3.a. input rate 15000

||batch id||input rows||input rows per second||processed rows per second||
| 25 | 165000 | 15136.2260343 | 15351.6933383 |
| 26 | 165000 | 15350.2651409 | 28128.196386 |
| 27 | 9 | 15342.6525742 | 16669.7536581 |
| 28 | 75000 | 13888.889 | 13557.483731 |
| 29 | 9 | 16266.0401229 | 15131.1365165 |
| 30 | 9 | 15128.5930408 | 13829.1333743 |

3.b. input rate 2

||batch id||input rows||input rows per second||processed rows per second||
| 23 | 318210 | 19853.3815822 | 20039.6750425 |
| 24 | 32 | 20151.1335013 | 23456.9711186 |
| 25 | 28 | 20523.3453053 | 15197.5683891 |

(cancelled since following batch takes too long...)

B. plenty of rows in session

1. HWX (Append Mode)

1.a. input rate 3

||batch id||input rows||input rows per second||processed rows per second||
| 21 | 295730 | 30210.4402901 | 25682.1537125 |
| 22 | 36 | 31260.8544634 | 25906.7357513 |
| 23 | 42 | 30222.3501475 | 28753.337441 |
| 24 | 42 | 28751.3691128 | 29702.970297 |
| 25 | 42 | 29700.8698112 | 28561.7137028 |

1.b. input rate 35000

||batch id||input rows||input rows per second||processed rows per second||
| 19 | 441716 | 36073.1727236 | 29971.2308319 |
| 20 | 49 | 33245.1319628 | 28194.9479257 |
| 21 | 63 | 36250.647333 | 30189.7642323 |
| 22 | 735000 | 35219.703867 | 28420.0757869 |
| 23 | 91 | 35185.323 | 30372.8179967 |

2. Baidu (Append Mode)

2.a rate 35000

||batch id||input rows||input rows per second||processed rows per second||
| 1 | 4335 | 752.081887578 | 111.233706251 |

(cancelled due to long running batch... and it even can't catch up input rate 
1000, as we already know)



was (Author: kabhwan):
Just ran another performance test to check my new trial of improving state.

Here I try to overwrite values for given key instead of removing all values and 
append new values for given key.
https://github.com/HeartSaVioR/spark/commit/6d466b9f424ae6a2b5a927e650f60ef35cfe30ca

The result was no luck (small performance hit compared to current), hence I 
would not put the numbers for that here. But I've run the test from AWS 
c5d.xlarge with dedicated option, hence more isolated and stable env. compared 
to before, which shows higher input rate.

Test Env.: c5d.xlarge, dedicated

A .plenty of sessions

1. HWX (Append Mode) 

1.a. input rate 2

||batch id||input rows||input rows per second||processed rows per second||
| 21 | 113355 | 20234.7375937 | 19278.0612245 |
| 22 | 118905 | 20218.5002551 | 17958.7675578 |
| 23 | 12 | 18121.4134703 | 15622.9657597 |
| 24 | 16 | 20827.9093986 | 14406.6270484 |
| 25 | 22 | 19807.3287116 | 12593.0165999 |

2. Baidu (Append Mode)

2.a. input rate 15000

||batch id||input rows||input rows per second||processed rows per second||
| 18 | 1005000 | 15068.3699172 | 5993.05878565 |
| 19 | 2505000 | 14937.8335669 | 4823.00254531 |

(cancelled since following batch takes too long... it even can't reach 1)

3. HWX (Update Mode)

3.a. input rate 15000

||batch id||input rows||input rows per second||processed rows per second||
| 25 | 165000 | 15136.2260343 | 15351.6933383 |
| 26 | 165000 | 15350.2651409 | 28128.196386 |
| 27 | 9 | 15342.6525742 | 16669.7536581 |
| 28 | 75000 | 13888.889 | 13557.483731 |
| 29 | 9 | 16266.0401229 | 15131.1365165 |
| 30 | 9 | 15128.5930408 | 13829.1333743 |

4. HWX (Update Mode) 

4.a. input 

[jira] [Commented] (SPARK-10816) EventTime based sessionization

2018-10-17 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654627#comment-16654627
 ] 

Jungtaek Lim commented on SPARK-10816:
--

Just ran another performance test to check my new trial of improving state.

Here I try to overwrite values for given key instead of removing all values and 
append new values for given key.
https://github.com/HeartSaVioR/spark/commit/6d466b9f424ae6a2b5a927e650f60ef35cfe30ca

The result was no luck (small performance hit compared to current), hence I 
would not put the numbers for that here. But I've run the test from AWS 
c5d.xlarge with dedicated option, hence more isolated and stable env. compared 
to before, which shows higher input rate.

Test Env.: c5d.xlarge, dedicated

A .plenty of sessions

1. HWX (Append Mode) 

1.a. input rate 2

||batch id||input rows||input rows per second||processed rows per second||
| 21 | 113355 | 20234.7375937 | 19278.0612245 |
| 22 | 118905 | 20218.5002551 | 17958.7675578 |
| 23 | 12 | 18121.4134703 | 15622.9657597 |
| 24 | 16 | 20827.9093986 | 14406.6270484 |
| 25 | 22 | 19807.3287116 | 12593.0165999 |

2. Baidu (Append Mode)

2.a. input rate 15000

||batch id||input rows||input rows per second||processed rows per second||
| 18 | 1005000 | 15068.3699172 | 5993.05878565 |
| 19 | 2505000 | 14937.8335669 | 4823.00254531 |

(cancelled since following batch takes too long... it even can't reach 1)

3. HWX (Update Mode)

3.a. input rate 15000

||batch id||input rows||input rows per second||processed rows per second||
| 25 | 165000 | 15136.2260343 | 15351.6933383 |
| 26 | 165000 | 15350.2651409 | 28128.196386 |
| 27 | 9 | 15342.6525742 | 16669.7536581 |
| 28 | 75000 | 13888.889 | 13557.483731 |
| 29 | 9 | 16266.0401229 | 15131.1365165 |
| 30 | 9 | 15128.5930408 | 13829.1333743 |

4. HWX (Update Mode) 

4.a. input rate 2

||batch id||input rows||input rows per second||processed rows per second||
| 23 | 318210 | 19853.3815822 | 20039.6750425 |
| 24 | 32 | 20151.1335013 | 23456.9711186 |
| 25 | 28 | 20523.3453053 | 15197.5683891 |

(cancelled since following batch takes too long...)

B. plenty of rows in session

1. HWX (Append Mode)

1.a. input rate 3

||batch id||input rows||input rows per second||processed rows per second||
| 21 | 295730 | 30210.4402901 | 25682.1537125 |
| 22 | 36 | 31260.8544634 | 25906.7357513 |
| 23 | 42 | 30222.3501475 | 28753.337441 |
| 24 | 42 | 28751.3691128 | 29702.970297 |
| 25 | 42 | 29700.8698112 | 28561.7137028 |

1.b. input rate 35000

||batch id||input rows||input rows per second||processed rows per second||
| 19 | 441716 | 36073.1727236 | 29971.2308319 |
| 20 | 49 | 33245.1319628 | 28194.9479257 |
| 21 | 63 | 36250.647333 | 30189.7642323 |
| 22 | 735000 | 35219.703867 | 28420.0757869 |
| 23 | 91 | 35185.323 | 30372.8179967 |

2. Baidu (Append Mode)

2.a rate 35000

||batch id||input rows||input rows per second||processed rows per second||
| 1 | 4335 | 752.081887578 | 111.233706251 |

(cancelled due to long running batch... and it even can't catch up input rate 
1000, as we already know)


> EventTime based sessionization
> --
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-10816 Support session window natively.pdf, Session 
> Window Support For Structure Streaming.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25003) Pyspark Does not use Spark Sql Extensions

2018-10-17 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25003.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 21990
[https://github.com/apache/spark/pull/21990]

> Pyspark Does not use Spark Sql Extensions
> -
>
> Key: SPARK-25003
> URL: https://issues.apache.org/jira/browse/SPARK-25003
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.2, 2.3.1
>Reporter: Russell Spitzer
>Assignee: Russell Spitzer
>Priority: Major
> Fix For: 3.0.0
>
>
> When creating a SparkSession here
> [https://github.com/apache/spark/blob/v2.2.2/python/pyspark/sql/session.py#L216]
> {code:python}
> if jsparkSession is None:
>   jsparkSession = self._jvm.SparkSession(self._jsc.sc())
> self._jsparkSession = jsparkSession
> {code}
> I believe it ends up calling the constructor here
> https://github.com/apache/spark/blob/v2.2.2/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L85-L87
> {code:scala}
>   private[sql] def this(sc: SparkContext) {
> this(sc, None, None, new SparkSessionExtensions)
>   }
> {code}
> Which creates a new SparkSessionsExtensions object and does not pick up new 
> extensions that could have been set in the config like the companion 
> getOrCreate does.
> https://github.com/apache/spark/blob/v2.2.2/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L928-L944
> {code:scala}
> //in getOrCreate
> // Initialize extensions if the user has defined a configurator class.
> val extensionConfOption = 
> sparkContext.conf.get(StaticSQLConf.SPARK_SESSION_EXTENSIONS)
> if (extensionConfOption.isDefined) {
>   val extensionConfClassName = extensionConfOption.get
>   try {
> val extensionConfClass = 
> Utils.classForName(extensionConfClassName)
> val extensionConf = extensionConfClass.newInstance()
>   .asInstanceOf[SparkSessionExtensions => Unit]
> extensionConf(extensions)
>   } catch {
> // Ignore the error if we cannot find the class or when the class 
> has the wrong type.
> case e @ (_: ClassCastException |
>   _: ClassNotFoundException |
>   _: NoClassDefFoundError) =>
>   logWarning(s"Cannot use $extensionConfClassName to configure 
> session extensions.", e)
>   }
> }
> {code}
> I think a quick fix would be to use the getOrCreate method from the companion 
> object instead of calling the constructor from the SparkContext. Or we could 
> fix this by ensuring that all constructors attempt to pick up custom 
> extensions if they are set.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25003) Pyspark Does not use Spark Sql Extensions

2018-10-17 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-25003:


Assignee: Russell Spitzer

> Pyspark Does not use Spark Sql Extensions
> -
>
> Key: SPARK-25003
> URL: https://issues.apache.org/jira/browse/SPARK-25003
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.2, 2.3.1
>Reporter: Russell Spitzer
>Assignee: Russell Spitzer
>Priority: Major
> Fix For: 3.0.0
>
>
> When creating a SparkSession here
> [https://github.com/apache/spark/blob/v2.2.2/python/pyspark/sql/session.py#L216]
> {code:python}
> if jsparkSession is None:
>   jsparkSession = self._jvm.SparkSession(self._jsc.sc())
> self._jsparkSession = jsparkSession
> {code}
> I believe it ends up calling the constructor here
> https://github.com/apache/spark/blob/v2.2.2/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L85-L87
> {code:scala}
>   private[sql] def this(sc: SparkContext) {
> this(sc, None, None, new SparkSessionExtensions)
>   }
> {code}
> Which creates a new SparkSessionsExtensions object and does not pick up new 
> extensions that could have been set in the config like the companion 
> getOrCreate does.
> https://github.com/apache/spark/blob/v2.2.2/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L928-L944
> {code:scala}
> //in getOrCreate
> // Initialize extensions if the user has defined a configurator class.
> val extensionConfOption = 
> sparkContext.conf.get(StaticSQLConf.SPARK_SESSION_EXTENSIONS)
> if (extensionConfOption.isDefined) {
>   val extensionConfClassName = extensionConfOption.get
>   try {
> val extensionConfClass = 
> Utils.classForName(extensionConfClassName)
> val extensionConf = extensionConfClass.newInstance()
>   .asInstanceOf[SparkSessionExtensions => Unit]
> extensionConf(extensions)
>   } catch {
> // Ignore the error if we cannot find the class or when the class 
> has the wrong type.
> case e @ (_: ClassCastException |
>   _: ClassNotFoundException |
>   _: NoClassDefFoundError) =>
>   logWarning(s"Cannot use $extensionConfClassName to configure 
> session extensions.", e)
>   }
> }
> {code}
> I think a quick fix would be to use the getOrCreate method from the companion 
> object instead of calling the constructor from the SparkContext. Or we could 
> fix this by ensuring that all constructors attempt to pick up custom 
> extensions if they are set.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12312) JDBC connection to Kerberos secured databases fails on remote executors

2018-10-17 Thread Foster Langbein (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654543#comment-16654543
 ] 

Foster Langbein commented on SPARK-12312:
-

So I just ran into this issue now trying to write to SQL Server. I have got to 
agree this is an important issue talking to SQL server - it is almost never 
allowed to use simple username/password authentication due to the security 
implications.

Could there at least be a note in the docs that this is not possible? Say in 
third paragraph here: 
[https://spark.apache.org/docs/2.3.2/sql-programming-guide.html#jdbc-to-other-databases]
 where it talks about using username/password as connection properties? I've 
spent a very, very long time trying to figure out why this wasn't possible. The 
way the executors behave is rather odd if you try it, so it wasn't obvious why 
it didn't work (at least to me).

 

> JDBC connection to Kerberos secured databases fails on remote executors
> ---
>
> Key: SPARK-12312
> URL: https://issues.apache.org/jira/browse/SPARK-12312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: nabacg
>Priority: Minor
>
> When loading DataFrames from JDBC datasource with Kerberos authentication, 
> remote executors (yarn-client/cluster etc. modes) fail to establish a 
> connection due to lack of Kerberos ticket or ability to generate it. 
> This is a real issue when trying to ingest data from kerberized data sources 
> (SQL Server, Oracle) in enterprise environment where exposing simple 
> authentication access is not an option due to IT policy issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-25733) The method toLocalIterator() with dataframe doesn't work

2018-10-17 Thread Bihui Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bihui Jin reopened SPARK-25733:
---

This issue isn't a duplicate of SPARK-23961

In SPARK-23961, toLocalIterator() is working but throw an exception if we do 
not consume all records when spark context stopped. In this issue,  
toLocalIterator() is not working and we can't get records from this iterator.

> The method toLocalIterator() with dataframe doesn't work
> 
>
> Key: SPARK-25733
> URL: https://issues.apache.org/jira/browse/SPARK-25733
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: Spark in standalone mode, and 48 cores are available.
> spark-defaults.conf as blew:
> spark.pyshark.python /usr/bin/python3.6
> spark.driver.memory 4g
> spark.executor.memory 8g
>  
> other configurations are at default.
>Reporter: Bihui Jin
>Priority: Major
> Attachments: report_dataset.zip.001, report_dataset.zip.002
>
>
> {color:#FF}The dataset which I used attached.{color}
>  
> First I loaded a dataframe from local disk:
> df = spark.read.load('report_dataset')
> there are about 200 partitions stored in s3, and the max size of partitions 
> is 28.37MB.
>  
> after data loaded,  I execute "df.take(1)" to test the dataframe, and 
> expected output printed 
> "[Row(s3_link='https://dcm-ul-phy.s3-china-1.eecloud.nsn-net.net/normal/run2/pool1/Tests.NbIot.NBCellSetupDelete.LTE3374_CellSetup_4x5M_2RX_3CELevel_Loop100.html',
>  sequences=[364, 15, 184, 34, 524, 49, 30, 527, 44, 366, 125, 85, 69, 524, 
> 49, 389, 575, 29, 179, 447, 168, 3, 223, 116, 573, 524, 49, 30, 527, 56, 366, 
> 125, 85, 524, 118, 295, 440, 123, 389, 32, 575, 529, 192, 524, 49, 389, 575, 
> 29, 179, 29, 140, 268, 96, 508, 389, 32, 575, 529, 192, 524, 49, 389, 575, 
> 29, 179, 180, 451, 69, 286, 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, 
> 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, 524, 49, 389, 575, 29, 42, 553, 
> 451, 368, 125, 88, 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, 
> 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, 588, 524, 49, 389], 
> next_word=575, line_num=12)]" 
>  
> Then I try to convert dataframe to the local iterator and want to print one 
> row in dataframe for testing, and blew code is used:
> for row in df.toLocalIterator():
>     print(row)
>     break
> {color:#ff}*But there is no output printed after that code 
> executed.*{color}
>  
> Then I execute "df.take(1)" and blew error is reported:
> ERROR:root:Exception while sending command.
> Traceback (most recent call last):
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> ERROR:root:Exception while sending command.
> Traceback (most recent call last):
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 985, in send_command
> response = connection.send_command(command)
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1164, in send_command
> "Error while receiving", e, proto.ERROR_ON_RECEIVE)
> py4j.protocol.Py4JNetworkError: Error while receiving
> ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java 
> server (127.0.0.1:37735)
> Traceback (most recent call last):
> File 
> "/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", 
> line 2963, in run_code
> exec(code_obj, self.user_global_ns, self.user_ns)
> File "", line 1, in 
> df.take(1)
> File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 
> 504, in take
> return self.limit(num).collect()
> File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 
> 493, in limit
> jdf = self._jdf.limit(num)
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1257, in __call__
> answer, self.gateway_client, self.target_id, self.name)
> File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, 
> in deco
> return f(*a, **kw)
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/protocol.py", line 336, in 
> get_return_value
> format(target_id, ".", name))
> py4j.protocol.Py4JError: An error occurred while calling o29.limit
> During handling of the above exception, another exception 

[jira] [Commented] (SPARK-25733) The method toLocalIterator() with dataframe doesn't work

2018-10-17 Thread Bihui Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654535#comment-16654535
 ] 

Bihui Jin commented on SPARK-25733:
---

Hi [~bryanc],

This issue isn't a duplicate of SPARK-23961

In SPARK-23961, toLocalIterator() is working but throw an exception if we do 
not consume all records when spark context stopped. In this issue,  
toLocalIterator() is not working, we can't get records from this iterator. 
Thanks.

 

Best Regards

Bihui Jin

> The method toLocalIterator() with dataframe doesn't work
> 
>
> Key: SPARK-25733
> URL: https://issues.apache.org/jira/browse/SPARK-25733
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: Spark in standalone mode, and 48 cores are available.
> spark-defaults.conf as blew:
> spark.pyshark.python /usr/bin/python3.6
> spark.driver.memory 4g
> spark.executor.memory 8g
>  
> other configurations are at default.
>Reporter: Bihui Jin
>Priority: Major
> Attachments: report_dataset.zip.001, report_dataset.zip.002
>
>
> {color:#FF}The dataset which I used attached.{color}
>  
> First I loaded a dataframe from local disk:
> df = spark.read.load('report_dataset')
> there are about 200 partitions stored in s3, and the max size of partitions 
> is 28.37MB.
>  
> after data loaded,  I execute "df.take(1)" to test the dataframe, and 
> expected output printed 
> "[Row(s3_link='https://dcm-ul-phy.s3-china-1.eecloud.nsn-net.net/normal/run2/pool1/Tests.NbIot.NBCellSetupDelete.LTE3374_CellSetup_4x5M_2RX_3CELevel_Loop100.html',
>  sequences=[364, 15, 184, 34, 524, 49, 30, 527, 44, 366, 125, 85, 69, 524, 
> 49, 389, 575, 29, 179, 447, 168, 3, 223, 116, 573, 524, 49, 30, 527, 56, 366, 
> 125, 85, 524, 118, 295, 440, 123, 389, 32, 575, 529, 192, 524, 49, 389, 575, 
> 29, 179, 29, 140, 268, 96, 508, 389, 32, 575, 529, 192, 524, 49, 389, 575, 
> 29, 179, 180, 451, 69, 286, 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, 
> 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, 524, 49, 389, 575, 29, 42, 553, 
> 451, 368, 125, 88, 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, 
> 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, 588, 524, 49, 389], 
> next_word=575, line_num=12)]" 
>  
> Then I try to convert dataframe to the local iterator and want to print one 
> row in dataframe for testing, and blew code is used:
> for row in df.toLocalIterator():
>     print(row)
>     break
> {color:#ff}*But there is no output printed after that code 
> executed.*{color}
>  
> Then I execute "df.take(1)" and blew error is reported:
> ERROR:root:Exception while sending command.
> Traceback (most recent call last):
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> ERROR:root:Exception while sending command.
> Traceback (most recent call last):
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 985, in send_command
> response = connection.send_command(command)
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1164, in send_command
> "Error while receiving", e, proto.ERROR_ON_RECEIVE)
> py4j.protocol.Py4JNetworkError: Error while receiving
> ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java 
> server (127.0.0.1:37735)
> Traceback (most recent call last):
> File 
> "/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", 
> line 2963, in run_code
> exec(code_obj, self.user_global_ns, self.user_ns)
> File "", line 1, in 
> df.take(1)
> File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 
> 504, in take
> return self.limit(num).collect()
> File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 
> 493, in limit
> jdf = self._jdf.limit(num)
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1257, in __call__
> answer, self.gateway_client, self.target_id, self.name)
> File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, 
> in deco
> return f(*a, **kw)
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/protocol.py", line 336, in 
> get_return_value
> format(target_id, ".", name))
> py4j.protocol.Py4JError

[jira] [Created] (SPARK-25761) sparksql执行sql语句的时候,sql语句已经执行成功,但是从sparkui上看该语句还是没有执行完成,还是running状态。

2018-10-17 Thread hanrentong (JIRA)
hanrentong created SPARK-25761:
--

 Summary: 
sparksql执行sql语句的时候,sql语句已经执行成功,但是从sparkui上看该语句还是没有执行完成,还是running状态。
 Key: SPARK-25761
 URL: https://issues.apache.org/jira/browse/SPARK-25761
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: hanrentong


我去sparksql上执行sql语句,sql语句已经执行完了,但是从sparkui上看语句还是running状态,并且也kill不掉



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-10-17 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654493#comment-16654493
 ] 

Wenchen Fan commented on SPARK-25588:
-

sounds like the problem is caused by the parquet upgrade in 2.4. Can you try to 
downgrade parquet and see if the problem goes away?

> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark version 2.3.2
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540
> Note the schema in Parquet are different:
> RDD code path
> {noformat}
> $ parquet-tools schema 
> /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
> message org.bdgenomics.formats.avro.Variant {
>   optional binary contigName (UTF8);
>   optional int64 start;
>   optional int64 end;
>   required group names (LIST) {
> repeated binary array (UTF8);
>   }
>   optional boolean splitFromMultiAllelic;
>   optional binary referenceAllele (UTF8);
>   optional binary alternateAllele (UTF8);
>   optional double quality;
>   optional boolean filtersApplied;
>   optional boolean filtersPassed;
>   required group filtersFailed (LIST) {
> repeated binary array (UTF8);
>   }
>   optional group annotation {
> optional binary ancestralAllele (UTF8);
> optional int32 alleleCount;
> optional int32 readDepth;
> optional int32 forwardReadDepth;
> optional int32 reverseReadDepth;
> optional int32 referenceReadDepth;
> optional int32 referenceForwardReadDepth;
> optional int32 referenceReverseReadDepth;
> optional float alleleFrequency;
> optional binary cigar (UTF8);
> optional boolean dbSnp;
> optional boolean hapMap2;
> optional boolean hapMap3;
> optional boolean validated;
> optional boolean thousandGenomes;
> optional boolean somatic;
> required group transcriptEffects (LIST) {
>   repeated group array {
> optional binary alternateAllele (UTF8);
> required group effects (LIST) {
>   repeated binary array (UTF8);
> }
> optional binary geneName (UTF8);
> optional binary geneId (UTF8);
> optional binary featureType (UTF8);
> optional binary featureId (UTF8);
> optional binary biotype (UTF8);
> optional int32 rank;
> optional int32 total;
> optional binary genomicHgvs (UTF8);
> optional binary transcriptHgvs (UTF8);
> optional binary proteinHgvs (UTF8);
> optional int32 cdnaPosition;
> optional int32 cdnaLength;
> optional int32 cdsPosition;
> optional int32 cdsLength;
> optional int32 proteinPosition;
> optional int32 proteinLength;
> optional int32 distance;
> required group messages (LIST) {
>   repeated binary array (ENUM);
> }
>   }
> }
> required group attributes (MAP) {
>   repeated group map (MAP_KEY_VALUE) {
> required binary key (UTF8);
> requi

[jira] [Assigned] (SPARK-25760) Set AddJarCommand return empty

2018-10-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25760:


Assignee: Apache Spark

> Set AddJarCommand return empty
> --
>
> Key: SPARK-25760
> URL: https://issues.apache.org/jira/browse/SPARK-25760
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25760) Set AddJarCommand return empty

2018-10-17 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-25760:

Description: 
{noformat}
spark-sql> add jar 
/Users/yumwang/spark/sql/hive/src/test/resources/TestUDTF.jar;
ADD JAR /Users/yumwang/spark/sql/hive/src/test/resources/TestUDTF.jar
0
spark-sql>{noformat}
Only {{AddJarCommand}} return 0, the user will be confused about what it means. 
It should be empty.

> Set AddJarCommand return empty
> --
>
> Key: SPARK-25760
> URL: https://issues.apache.org/jira/browse/SPARK-25760
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Minor
>
> {noformat}
> spark-sql> add jar 
> /Users/yumwang/spark/sql/hive/src/test/resources/TestUDTF.jar;
> ADD JAR /Users/yumwang/spark/sql/hive/src/test/resources/TestUDTF.jar
> 0
> spark-sql>{noformat}
> Only {{AddJarCommand}} return 0, the user will be confused about what it 
> means. It should be empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25760) Set AddJarCommand return empty

2018-10-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654437#comment-16654437
 ] 

Apache Spark commented on SPARK-25760:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/22747

> Set AddJarCommand return empty
> --
>
> Key: SPARK-25760
> URL: https://issues.apache.org/jira/browse/SPARK-25760
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25760) Set AddJarCommand return empty

2018-10-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25760:


Assignee: (was: Apache Spark)

> Set AddJarCommand return empty
> --
>
> Key: SPARK-25760
> URL: https://issues.apache.org/jira/browse/SPARK-25760
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25760) Set AddJarCommand return empty

2018-10-17 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-25760:
---

 Summary: Set AddJarCommand return empty
 Key: SPARK-25760
 URL: https://issues.apache.org/jira/browse/SPARK-25760
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23390) Flaky test: FileBasedDataSourceSuite

2018-10-17 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654417#comment-16654417
 ] 

Xiao Li commented on SPARK-23390:
-

Thanks!

> Flaky test: FileBasedDataSourceSuite
> 
>
> Key: SPARK-23390
> URL: https://issues.apache.org/jira/browse/SPARK-23390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Sameer Agarwal
>Assignee: Dongjoon Hyun
>Priority: Critical
>
> *RECENT HISTORY*
> [http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.FileBasedDataSourceSuite&test_name=%28It+is+not+a+test+it+is+a+sbt.testing.SuiteSelector%29]
>  
> 
> We're seeing multiple failures in {{FileBasedDataSourceSuite}} in 
> {{spark-branch-2.3-test-sbt-hadoop-2.7}}:
> {code:java}
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 
> 10.01215805999 seconds. Last failure message: There are 1 possibly leaked 
> file streams..
> {code}
> Here's the full history: 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/]
> From a very quick look, these failures seem to be correlated with 
> [https://github.com/apache/spark/pull/20479] (cc [~dongjoon]) as evident from 
> the following stack trace (full logs 
> [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]):
> {code:java}
> [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
> 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
> 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
> connection created at:
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
>   at 
> org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
>   at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
> {code}
> Also, while this might be just a false correlation but the frequency of these 
> test failures have increased considerably in 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/]
>  after [https://github.com/apache/spark/pull/20562] (cc 
> [~feng...@databricks.com]) was merged.
> The following is Parquet leakage.
> {code:java}
> Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
> {code}
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/322/]
>  (May 3rd)
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/331/]
>  (May 9th)
>  - [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90536] 
> (May 11st)
>  - 
> [https://

[jira] [Assigned] (SPARK-23390) Flaky test: FileBasedDataSourceSuite

2018-10-17 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-23390:
---

Assignee: Dongjoon Hyun  (was: Wenchen Fan)

> Flaky test: FileBasedDataSourceSuite
> 
>
> Key: SPARK-23390
> URL: https://issues.apache.org/jira/browse/SPARK-23390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Sameer Agarwal
>Assignee: Dongjoon Hyun
>Priority: Critical
>
> *RECENT HISTORY*
> [http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.FileBasedDataSourceSuite&test_name=%28It+is+not+a+test+it+is+a+sbt.testing.SuiteSelector%29]
>  
> 
> We're seeing multiple failures in {{FileBasedDataSourceSuite}} in 
> {{spark-branch-2.3-test-sbt-hadoop-2.7}}:
> {code:java}
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 
> 10.01215805999 seconds. Last failure message: There are 1 possibly leaked 
> file streams..
> {code}
> Here's the full history: 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/]
> From a very quick look, these failures seem to be correlated with 
> [https://github.com/apache/spark/pull/20479] (cc [~dongjoon]) as evident from 
> the following stack trace (full logs 
> [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]):
> {code:java}
> [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
> 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
> 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
> connection created at:
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
>   at 
> org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
>   at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
> {code}
> Also, while this might be just a false correlation but the frequency of these 
> test failures have increased considerably in 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/]
>  after [https://github.com/apache/spark/pull/20562] (cc 
> [~feng...@databricks.com]) was merged.
> The following is Parquet leakage.
> {code:java}
> Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
> {code}
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/322/]
>  (May 3rd)
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/331/]
>  (May 9th)
>  - [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90536] 
> (May 11st)
>  - 
> [https://amplab.cs.b

[jira] [Commented] (SPARK-23390) Flaky test: FileBasedDataSourceSuite

2018-10-17 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654413#comment-16654413
 ] 

Dongjoon Hyun commented on SPARK-23390:
---

[~zsxwing], [~cloud_fan], and [~smilegator].
Both ORC-416 and ORC-419 are fixed and will be released as next ORC release 
1.5.4 and 1.6.0.
And, I'm still looking around both sides (ORC and Spark) for this issue.

> Flaky test: FileBasedDataSourceSuite
> 
>
> Key: SPARK-23390
> URL: https://issues.apache.org/jira/browse/SPARK-23390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Sameer Agarwal
>Assignee: Wenchen Fan
>Priority: Critical
>
> *RECENT HISTORY*
> [http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.FileBasedDataSourceSuite&test_name=%28It+is+not+a+test+it+is+a+sbt.testing.SuiteSelector%29]
>  
> 
> We're seeing multiple failures in {{FileBasedDataSourceSuite}} in 
> {{spark-branch-2.3-test-sbt-hadoop-2.7}}:
> {code:java}
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 
> 10.01215805999 seconds. Last failure message: There are 1 possibly leaked 
> file streams..
> {code}
> Here's the full history: 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/]
> From a very quick look, these failures seem to be correlated with 
> [https://github.com/apache/spark/pull/20479] (cc [~dongjoon]) as evident from 
> the following stack trace (full logs 
> [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]):
> {code:java}
> [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
> 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
> 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
> connection created at:
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
>   at 
> org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
>   at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
> {code}
> Also, while this might be just a false correlation but the frequency of these 
> test failures have increased considerably in 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/]
>  after [https://github.com/apache/spark/pull/20562] (cc 
> [~feng...@databricks.com]) was merged.
> The following is Parquet leakage.
> {code:java}
> Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
> {code}
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/322/]
>  (May 3rd)
>  - 
> [https://amplab.cs.berkeley.edu/je

[jira] [Comment Edited] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-10-17 Thread Michael Heuer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654400#comment-16654400
 ] 

Michael Heuer edited comment on SPARK-25588 at 10/17/18 11:56 PM:
--

[~Gengliang.Wang] The unit test provided is only an attempt to reproduce the 
actual error, which happens downstream in ADAM.  In ADAM, we have been 
struggling with Spark's conflicting Parquet and Avro dependencies for many 
versions.  Our most recent workaround is to pin parquet-avro to version 1.8.1 
and exclude all its transitive dependencies.  This workaround worked for 2.3.2, 
thus I gave the last RC a non-binding +1.

[https://github.com/bigdatagenomics/adam/blob/master/pom.xml#L520]

 

That workaround does not work for 2.4.0, as this pinned version 1.8.1 conflicts 
at runtime with version 1.10.0 brought in by Spark.
{noformat}
$ mvn test
...
*** RUN ABORTED ***
  java.lang.NoSuchFieldError: BROTLI
  at 
org.apache.parquet.hadoop.metadata.CompressionCodecName.(CompressionCodecName.java:31)
  at 
org.bdgenomics.adam.rdd.JavaSaveArgs$.$lessinit$greater$default$4(GenomicRDD.scala:78){noformat}
 

Removing the pinned version and dependency exclusions, bringing the build 
dependency version to 1.10.0, results in the error reported here in our unit 
tests under Spark version 2.4.0.

[https://github.com/bigdatagenomics/adam/pull/2056]

 

Doing the same thing also results in the error reported here in our unit tests 
under Spark version 2.3.2.

[https://github.com/bigdatagenomics/adam/pull/2055]

 

As mentioned above, I've reported this error against Parquet as 
https://issues.apache.org/jira/browse/PARQUET-1441

 


was (Author: heuermh):
[~Gengliang.Wang] The unit test provided is only an attempt to reproduce the 
actual error, which happens downstream in ADAM.  In ADAM, we have been 
struggling with Spark's conflicting Parquet and Avro dependencies for many 
versions.  Our most recent workaround is to pin parquet-avro to version 1.8.1 
and exclude all its transitive dependencies.  This workaround worked for 2.3.2, 
thus I gave the last RC a non-binding +1.

[https://github.com/bigdatagenomics/adam/blob/master/pom.xml#L520]

That workaround does not work for 2.4.0, as this pinned version 1.8.1 conflicts 
at runtime with version 1.10.0 brought in by Spark.
{noformat}
$ mvn test
...
*** RUN ABORTED ***
  java.lang.NoSuchFieldError: BROTLI
  at 
org.apache.parquet.hadoop.metadata.CompressionCodecName.(CompressionCodecName.java:31)
  at 
org.bdgenomics.adam.rdd.JavaSaveArgs$.$lessinit$greater$default$4(GenomicRDD.scala:78){noformat}
 

Removing the pinned version and dependency exclusions, bringing the build 
dependency version to 1.10.0, results in the error reported here in our unit 
tests under Spark version 2.4.0.

[https://github.com/bigdatagenomics/adam/pull/2056]

 

Doing the same thing also results in the error reported here in our unit tests 
under Spark version 2.3.2.

[https://github.com/bigdatagenomics/adam/pull/2055]

 

As mentioned above, I've reported this error against Parquet as 
https://issues.apache.org/jira/browse/PARQUET-1441

 

> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark version 2.3.2
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
> 

[jira] [Comment Edited] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-10-17 Thread Michael Heuer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654400#comment-16654400
 ] 

Michael Heuer edited comment on SPARK-25588 at 10/17/18 11:55 PM:
--

[~Gengliang.Wang] The unit test provided is only an attempt to reproduce the 
actual error, which happens downstream in ADAM.  In ADAM, we have been 
struggling with Spark's conflicting Parquet and Avro dependencies for many 
versions.  Our most recent workaround is to pin parquet-avro to version 1.8.1 
and exclude all its transitive dependencies.  This workaround worked for 2.3.2, 
thus I gave the last RC a non-binding +1.

[https://github.com/bigdatagenomics/adam/blob/master/pom.xml#L520]

That workaround does not work for 2.4.0, as this pinned version 1.8.1 conflicts 
at runtime with version 1.10.0 brought in by Spark.
{noformat}
$ mvn test
...
*** RUN ABORTED ***
  java.lang.NoSuchFieldError: BROTLI
  at 
org.apache.parquet.hadoop.metadata.CompressionCodecName.(CompressionCodecName.java:31)
  at 
org.bdgenomics.adam.rdd.JavaSaveArgs$.$lessinit$greater$default$4(GenomicRDD.scala:78){noformat}
 

Removing the pinned version and dependency exclusions, bringing the build 
dependency version to 1.10.0, results in the error reported here in our unit 
tests under Spark version 2.4.0.

[https://github.com/bigdatagenomics/adam/pull/2056]

 

Doing the same thing also results in the error reported here in our unit tests 
under Spark version 2.3.2.

[https://github.com/bigdatagenomics/adam/pull/2055]

 

As mentioned above, I've reported this error against Parquet as 
https://issues.apache.org/jira/browse/PARQUET-1441

 


was (Author: heuermh):
[~Gengliang.Wang] The unit test provided is only an attempt to reproduce the 
actual error, which happens downstream in ADAM.  In ADAM, we have been 
struggling with Spark's conflicting Parquet and Avro dependencies for many 
versions.  Our most recent workaround is to pin parquet-avro to version 1.8.1 
and exclude all its transitive dependencies.  This workaround worked for 2.3.2, 
thus I gave the last RC a non-binding +1.

[https://github.com/bigdatagenomics/adam/blob/master/pom.xml#L520]


That workaround does not work for 2.4.0, as this pinned version 1.8.1 conflicts 
at runtime with version 1.10.0 brought in by Spark.
{noformat}
$ mvn test
...
*** RUN ABORTED ***
  java.lang.NoSuchFieldError: BROTLI
  at 
org.apache.parquet.hadoop.metadata.CompressionCodecName.(CompressionCodecName.java:31)
  at 
org.bdgenomics.adam.rdd.JavaSaveArgs$.$lessinit$greater$default$4(GenomicRDD.scala:78){noformat}

Removing the pinned version and dependency exclusions, bringing the build 
dependency version to 1.10.0, results in the error reported here in our unit 
tests under Spark version 2.4.0.  Doing the same thing also results in the 
error reported here in our unit tests under Spark version 2.3.2.

> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark version 2.3.2
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgeno

[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-10-17 Thread Michael Heuer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654400#comment-16654400
 ] 

Michael Heuer commented on SPARK-25588:
---

[~Gengliang.Wang] The unit test provided is only an attempt to reproduce the 
actual error, which happens downstream in ADAM.  In ADAM, we have been 
struggling with Spark's conflicting Parquet and Avro dependencies for many 
versions.  Our most recent workaround is to pin parquet-avro to version 1.8.1 
and exclude all its transitive dependencies.  This workaround worked for 2.3.2, 
thus I gave the last RC a non-binding +1.

[https://github.com/bigdatagenomics/adam/blob/master/pom.xml#L520]


That workaround does not work for 2.4.0, as this pinned version 1.8.1 conflicts 
at runtime with version 1.10.0 brought in by Spark.
{noformat}
$ mvn test
...
*** RUN ABORTED ***
  java.lang.NoSuchFieldError: BROTLI
  at 
org.apache.parquet.hadoop.metadata.CompressionCodecName.(CompressionCodecName.java:31)
  at 
org.bdgenomics.adam.rdd.JavaSaveArgs$.$lessinit$greater$default$4(GenomicRDD.scala:78){noformat}

Removing the pinned version and dependency exclusions, bringing the build 
dependency version to 1.10.0, results in the error reported here in our unit 
tests under Spark version 2.4.0.  Doing the same thing also results in the 
error reported here in our unit tests under Spark version 2.3.2.

> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark version 2.3.2
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540
> Note the schema in Parquet are different:
> RDD code path
> {noformat}
> $ parquet-tools schema 
> /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
> message org.bdgenomics.formats.avro.Variant {
>   optional binary contigName (UTF8);
>   optional int64 start;
>   optional int64 end;
>   required group names (LIST) {
> repeated binary array (UTF8);
>   }
>   optional boolean splitFromMultiAllelic;
>   optional binary referenceAllele (UTF8);
>   optional binary alternateAllele (UTF8);
>   optional double quality;
>   optional boolean filtersApplied;
>   optional boolean filtersPassed;
>   required group filtersFailed (LIST) {
> repeated binary array (UTF8);
>   }
>   optional group annotation {
> optional binary ancestralAllele (UTF8);
> optional int32 alleleCount;
> optional int32 readDepth;
> optional int32 forwardReadDepth;
> optional int32 reverseReadDepth;
> optional int32 referenceReadDepth;
> optional int32 referenceForwardReadDepth;
> optional int32 referenceReverseReadDepth;
> optional float alleleFrequency;
> optional binary cigar (UTF8);
> optional boolean dbSnp;
> optional boolean hapMap2;
> optional boolean hapMap3;
> optional boolean validated;
> optional boolean thousandGenomes;
> optional boolean somatic;
> required group transcriptEffects (LIST) {
>   repeated group array {
> 

[jira] [Assigned] (SPARK-25751) Unit Testing for Kerberos Support for Spark on Kubernetes

2018-10-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25751:


Assignee: (was: Apache Spark)

> Unit Testing for Kerberos Support for Spark on Kubernetes
> -
>
> Key: SPARK-25751
> URL: https://issues.apache.org/jira/browse/SPARK-25751
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Ilan Filonenko
>Priority: Major
>
> Unit tests for Kerberos Support within Spark on Kubernetes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25751) Unit Testing for Kerberos Support for Spark on Kubernetes

2018-10-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654312#comment-16654312
 ] 

Apache Spark commented on SPARK-25751:
--

User 'ifilonenko' has created a pull request for this issue:
https://github.com/apache/spark/pull/22760

> Unit Testing for Kerberos Support for Spark on Kubernetes
> -
>
> Key: SPARK-25751
> URL: https://issues.apache.org/jira/browse/SPARK-25751
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Ilan Filonenko
>Priority: Major
>
> Unit tests for Kerberos Support within Spark on Kubernetes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25751) Unit Testing for Kerberos Support for Spark on Kubernetes

2018-10-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25751:


Assignee: Apache Spark

> Unit Testing for Kerberos Support for Spark on Kubernetes
> -
>
> Key: SPARK-25751
> URL: https://issues.apache.org/jira/browse/SPARK-25751
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Ilan Filonenko
>Assignee: Apache Spark
>Priority: Major
>
> Unit tests for Kerberos Support within Spark on Kubernetes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25332) Instead of broadcast hash join ,Sort merge join has selected when restart spark-shell/spark-JDBC for hive provider

2018-10-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654170#comment-16654170
 ] 

Apache Spark commented on SPARK-25332:
--

User 'sujith71955' has created a pull request for this issue:
https://github.com/apache/spark/pull/22758

> Instead of broadcast hash join  ,Sort merge join has selected when restart 
> spark-shell/spark-JDBC for hive provider
> ---
>
> Key: SPARK-25332
> URL: https://issues.apache.org/jira/browse/SPARK-25332
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Babulal
>Priority: Major
>
> spark.sql("create table x1(name string,age int) stored as parquet ")
>  spark.sql("insert into x1 select 'a',29")
>  spark.sql("create table x2 (name string,age int) stored as parquet '")
>  spark.sql("insert into x2_ex select 'a',29")
>  scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
> == Physical Plan ==
> *{color:#14892c}(2) BroadcastHashJoin{color} [name#101], [name#103], Inner, 
> BuildRight
> :- *(2) Project [name#101, age#102]
> : +- *(2) Filter isnotnull(name#101)
> : +- *(2) FileScan parquet default.x1_ex[name#101,age#102] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1, 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
>  +- *(1) Project [name#103, age#104]
>  +- *(1) Filter isnotnull(name#103)
>  +- *(1) FileScan parquet default.x2_ex[name#103,age#104] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2, 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
>  
>  
> Now Restart Spark-Shell or do spark-submit orrestart JDBCServer  again and 
> run same select query again
>  
> scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
> scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
> == Physical Plan ==
> *{color:#FF}(5) SortMergeJoin [{color}name#43], [name#45], Inner
> :- *(2) Sort [name#43 ASC NULLS FIRST], false, 0
> : +- Exchange hashpartitioning(name#43, 200)
> : +- *(1) Project [name#43, age#44]
> : +- *(1) Filter isnotnull(name#43)
> : +- *(1) FileScan parquet default.x1[name#43,age#44] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1], 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
> +- *(4) Sort [name#45 ASC NULLS FIRST], false, 0
>  +- Exchange hashpartitioning(name#45, 200)
>  +- *(3) Project [name#45, age#46]
>  +- *(3) Filter isnotnull(name#45)
>  +- *(3) FileScan parquet default.x2[name#45,age#46] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2], 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
>  
>  
> scala> spark.sql("desc formatted x1").show(200,false)
> ++--+---+
> |col_name |data_type |comment|
> ++--+---+
> |name |string |null |
> |age |int |null |
> | | | |
> |# Detailed Table Information| | |
> |Database |default | |
> |Table |x1 | |
> |Owner |Administrator | |
> |Created Time |Sun Aug 19 12:36:58 IST 2018 | |
> |Last Access |Thu Jan 01 05:30:00 IST 1970 | |
> |Created By |Spark 2.3.0 | |
> |Type |MANAGED | |
> |Provider |hive | |
> |Table Properties |[transient_lastDdlTime=1534662418] | |
> |Location |file:/D:/spark_release/spark/bin/spark-warehouse/x1 | |
> |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | 
> |
> |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | 
> |
> |OutputFormat 
> |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| |
> |Storage Properties |[serialization.format=1] | |
> |Partition Provider |Catalog | |
> ++--+---+
>  
> With datasource table ,working fine ( create table using parquet instead of 
> stored by )



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25332) Instead of broadcast hash join ,Sort merge join has selected when restart spark-shell/spark-JDBC for hive provider

2018-10-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25332:


Assignee: Apache Spark

> Instead of broadcast hash join  ,Sort merge join has selected when restart 
> spark-shell/spark-JDBC for hive provider
> ---
>
> Key: SPARK-25332
> URL: https://issues.apache.org/jira/browse/SPARK-25332
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Babulal
>Assignee: Apache Spark
>Priority: Major
>
> spark.sql("create table x1(name string,age int) stored as parquet ")
>  spark.sql("insert into x1 select 'a',29")
>  spark.sql("create table x2 (name string,age int) stored as parquet '")
>  spark.sql("insert into x2_ex select 'a',29")
>  scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
> == Physical Plan ==
> *{color:#14892c}(2) BroadcastHashJoin{color} [name#101], [name#103], Inner, 
> BuildRight
> :- *(2) Project [name#101, age#102]
> : +- *(2) Filter isnotnull(name#101)
> : +- *(2) FileScan parquet default.x1_ex[name#101,age#102] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1, 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
>  +- *(1) Project [name#103, age#104]
>  +- *(1) Filter isnotnull(name#103)
>  +- *(1) FileScan parquet default.x2_ex[name#103,age#104] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2, 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
>  
>  
> Now Restart Spark-Shell or do spark-submit orrestart JDBCServer  again and 
> run same select query again
>  
> scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
> scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
> == Physical Plan ==
> *{color:#FF}(5) SortMergeJoin [{color}name#43], [name#45], Inner
> :- *(2) Sort [name#43 ASC NULLS FIRST], false, 0
> : +- Exchange hashpartitioning(name#43, 200)
> : +- *(1) Project [name#43, age#44]
> : +- *(1) Filter isnotnull(name#43)
> : +- *(1) FileScan parquet default.x1[name#43,age#44] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1], 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
> +- *(4) Sort [name#45 ASC NULLS FIRST], false, 0
>  +- Exchange hashpartitioning(name#45, 200)
>  +- *(3) Project [name#45, age#46]
>  +- *(3) Filter isnotnull(name#45)
>  +- *(3) FileScan parquet default.x2[name#45,age#46] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2], 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
>  
>  
> scala> spark.sql("desc formatted x1").show(200,false)
> ++--+---+
> |col_name |data_type |comment|
> ++--+---+
> |name |string |null |
> |age |int |null |
> | | | |
> |# Detailed Table Information| | |
> |Database |default | |
> |Table |x1 | |
> |Owner |Administrator | |
> |Created Time |Sun Aug 19 12:36:58 IST 2018 | |
> |Last Access |Thu Jan 01 05:30:00 IST 1970 | |
> |Created By |Spark 2.3.0 | |
> |Type |MANAGED | |
> |Provider |hive | |
> |Table Properties |[transient_lastDdlTime=1534662418] | |
> |Location |file:/D:/spark_release/spark/bin/spark-warehouse/x1 | |
> |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | 
> |
> |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | 
> |
> |OutputFormat 
> |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| |
> |Storage Properties |[serialization.format=1] | |
> |Partition Provider |Catalog | |
> ++--+---+
>  
> With datasource table ,working fine ( create table using parquet instead of 
> stored by )



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25332) Instead of broadcast hash join ,Sort merge join has selected when restart spark-shell/spark-JDBC for hive provider

2018-10-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25332:


Assignee: (was: Apache Spark)

> Instead of broadcast hash join  ,Sort merge join has selected when restart 
> spark-shell/spark-JDBC for hive provider
> ---
>
> Key: SPARK-25332
> URL: https://issues.apache.org/jira/browse/SPARK-25332
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Babulal
>Priority: Major
>
> spark.sql("create table x1(name string,age int) stored as parquet ")
>  spark.sql("insert into x1 select 'a',29")
>  spark.sql("create table x2 (name string,age int) stored as parquet '")
>  spark.sql("insert into x2_ex select 'a',29")
>  scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
> == Physical Plan ==
> *{color:#14892c}(2) BroadcastHashJoin{color} [name#101], [name#103], Inner, 
> BuildRight
> :- *(2) Project [name#101, age#102]
> : +- *(2) Filter isnotnull(name#101)
> : +- *(2) FileScan parquet default.x1_ex[name#101,age#102] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1, 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
>  +- *(1) Project [name#103, age#104]
>  +- *(1) Filter isnotnull(name#103)
>  +- *(1) FileScan parquet default.x2_ex[name#103,age#104] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2, 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
>  
>  
> Now Restart Spark-Shell or do spark-submit orrestart JDBCServer  again and 
> run same select query again
>  
> scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
> scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
> == Physical Plan ==
> *{color:#FF}(5) SortMergeJoin [{color}name#43], [name#45], Inner
> :- *(2) Sort [name#43 ASC NULLS FIRST], false, 0
> : +- Exchange hashpartitioning(name#43, 200)
> : +- *(1) Project [name#43, age#44]
> : +- *(1) Filter isnotnull(name#43)
> : +- *(1) FileScan parquet default.x1[name#43,age#44] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1], 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
> +- *(4) Sort [name#45 ASC NULLS FIRST], false, 0
>  +- Exchange hashpartitioning(name#45, 200)
>  +- *(3) Project [name#45, age#46]
>  +- *(3) Filter isnotnull(name#45)
>  +- *(3) FileScan parquet default.x2[name#45,age#46] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2], 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
>  
>  
> scala> spark.sql("desc formatted x1").show(200,false)
> ++--+---+
> |col_name |data_type |comment|
> ++--+---+
> |name |string |null |
> |age |int |null |
> | | | |
> |# Detailed Table Information| | |
> |Database |default | |
> |Table |x1 | |
> |Owner |Administrator | |
> |Created Time |Sun Aug 19 12:36:58 IST 2018 | |
> |Last Access |Thu Jan 01 05:30:00 IST 1970 | |
> |Created By |Spark 2.3.0 | |
> |Type |MANAGED | |
> |Provider |hive | |
> |Table Properties |[transient_lastDdlTime=1534662418] | |
> |Location |file:/D:/spark_release/spark/bin/spark-warehouse/x1 | |
> |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | 
> |
> |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | 
> |
> |OutputFormat 
> |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| |
> |Storage Properties |[serialization.format=1] | |
> |Partition Provider |Catalog | |
> ++--+---+
>  
> With datasource table ,working fine ( create table using parquet instead of 
> stored by )



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25332) Instead of broadcast hash join ,Sort merge join has selected when restart spark-shell/spark-JDBC for hive provider

2018-10-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654168#comment-16654168
 ] 

Apache Spark commented on SPARK-25332:
--

User 'sujith71955' has created a pull request for this issue:
https://github.com/apache/spark/pull/22758

> Instead of broadcast hash join  ,Sort merge join has selected when restart 
> spark-shell/spark-JDBC for hive provider
> ---
>
> Key: SPARK-25332
> URL: https://issues.apache.org/jira/browse/SPARK-25332
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Babulal
>Priority: Major
>
> spark.sql("create table x1(name string,age int) stored as parquet ")
>  spark.sql("insert into x1 select 'a',29")
>  spark.sql("create table x2 (name string,age int) stored as parquet '")
>  spark.sql("insert into x2_ex select 'a',29")
>  scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
> == Physical Plan ==
> *{color:#14892c}(2) BroadcastHashJoin{color} [name#101], [name#103], Inner, 
> BuildRight
> :- *(2) Project [name#101, age#102]
> : +- *(2) Filter isnotnull(name#101)
> : +- *(2) FileScan parquet default.x1_ex[name#101,age#102] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1, 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
>  +- *(1) Project [name#103, age#104]
>  +- *(1) Filter isnotnull(name#103)
>  +- *(1) FileScan parquet default.x2_ex[name#103,age#104] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2, 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
>  
>  
> Now Restart Spark-Shell or do spark-submit orrestart JDBCServer  again and 
> run same select query again
>  
> scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
> scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
> == Physical Plan ==
> *{color:#FF}(5) SortMergeJoin [{color}name#43], [name#45], Inner
> :- *(2) Sort [name#43 ASC NULLS FIRST], false, 0
> : +- Exchange hashpartitioning(name#43, 200)
> : +- *(1) Project [name#43, age#44]
> : +- *(1) Filter isnotnull(name#43)
> : +- *(1) FileScan parquet default.x1[name#43,age#44] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1], 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
> +- *(4) Sort [name#45 ASC NULLS FIRST], false, 0
>  +- Exchange hashpartitioning(name#45, 200)
>  +- *(3) Project [name#45, age#46]
>  +- *(3) Filter isnotnull(name#45)
>  +- *(3) FileScan parquet default.x2[name#45,age#46] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2], 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
>  
>  
> scala> spark.sql("desc formatted x1").show(200,false)
> ++--+---+
> |col_name |data_type |comment|
> ++--+---+
> |name |string |null |
> |age |int |null |
> | | | |
> |# Detailed Table Information| | |
> |Database |default | |
> |Table |x1 | |
> |Owner |Administrator | |
> |Created Time |Sun Aug 19 12:36:58 IST 2018 | |
> |Last Access |Thu Jan 01 05:30:00 IST 1970 | |
> |Created By |Spark 2.3.0 | |
> |Type |MANAGED | |
> |Provider |hive | |
> |Table Properties |[transient_lastDdlTime=1534662418] | |
> |Location |file:/D:/spark_release/spark/bin/spark-warehouse/x1 | |
> |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | 
> |
> |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | 
> |
> |OutputFormat 
> |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| |
> |Storage Properties |[serialization.format=1] | |
> |Partition Provider |Catalog | |
> ++--+---+
>  
> With datasource table ,working fine ( create table using parquet instead of 
> stored by )



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23015) spark-submit fails when submitting several jobs in parallel

2018-10-17 Thread Bansal, Parvesh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654072#comment-16654072
 ] 

Bansal, Parvesh commented on SPARK-23015:
-

Hi Hugh Zabriskie

I was following the issue , as I am also facing the same challenges while 
working under windows environment [using python threads to launch multiple 
spark-submit]
Can you please let me know the current resolution or workaround for windows, is 
it fixed for Linux environment?

Please suggest.

Thanks and Regards
Parvesh K Bansal
--


> spark-submit fails when submitting several jobs in parallel
> ---
>
> Key: SPARK-23015
> URL: https://issues.apache.org/jira/browse/SPARK-23015
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1
> Environment: Windows 10 (1709/16299.125)
> Spark 2.3.0
> Java 8, Update 151
>Reporter: Hugh Zabriskie
>Priority: Major
>
> Spark Submit's launching library prints the command to execute the launcher 
> (org.apache.spark.launcher.main) to a temporary text file, reads the result 
> back into a variable, and then executes that command.
> {code}
> set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt
> "%RUNNER%" -Xmx128m -cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main 
> %* > %LAUNCHER_OUTPUT%
> {code}
> [bin/spark-class2.cmd, 
> L67|https://github.com/apache/spark/blob/master/bin/spark-class2.cmd#L66]
> That temporary text file is given a pseudo-random name by the %RANDOM% env 
> variable generator, which generates a number between 0 and 32767.
> This appears to be the cause of an error occurring when several spark-submit 
> jobs are launched simultaneously. The following error is returned from stderr:
> {quote}The process cannot access the file because it is being used by another 
> process. The system cannot find the file
> USER/AppData/Local/Temp/spark-class-launcher-output-RANDOM.txt.
> The process cannot access the file because it is being used by another 
> process.{quote}
> My hypothesis is that %RANDOM% is returning the same value for multiple jobs, 
> causing the launcher library to attempt to write to the same file from 
> multiple processes. Another mechanism is needed for reliably generating the 
> names of the temporary files so that the concurrency issue is resolved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24601) Bump Jackson version to 2.9.6

2018-10-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653772#comment-16653772
 ] 

Apache Spark commented on SPARK-24601:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/22757

> Bump Jackson version to 2.9.6
> -
>
> Key: SPARK-24601
> URL: https://issues.apache.org/jira/browse/SPARK-24601
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Fokko Driesprong
>Priority: Major
> Fix For: 3.0.0
>
>
> The Jackson version is lacking behind, and therefore I have to add a lot of 
> exclusions to the SBT files: 
> ```
> Caused by: com.fasterxml.jackson.databind.JsonMappingException: Incompatible 
> Jackson version: 2.9.5
>   at 
> com.fasterxml.jackson.module.scala.JacksonModule$class.setupModule(JacksonModule.scala:64)
>   at 
> com.fasterxml.jackson.module.scala.DefaultScalaModule.setupModule(DefaultScalaModule.scala:19)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper.registerModule(ObjectMapper.java:751)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala:82)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala)
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25759) StreamingListenerBus: Listener JavaStreamingListenerWrapper threw an exception

2018-10-17 Thread Parhy (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Parhy updated SPARK-25759:
--
Affects Version/s: (was: 2.2.0)
   2.2.1

> StreamingListenerBus: Listener JavaStreamingListenerWrapper threw an exception
> --
>
> Key: SPARK-25759
> URL: https://issues.apache.org/jira/browse/SPARK-25759
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
> Environment: Any
>Reporter: Parhy
>Priority: Blocker
>
> I am using pyspark with spark Streaming version 2.2.1 . I am using AWS s3 as 
> my source.
> Once the batch is complete. I would like to remove the files from S3. I have 
> extended the StreamingListener class. I can see the listener is called once 
> the batch is complete. But I am getting an exception as well. I am putting 
> the exception. I did see a stackoverflow question with the exact same 
> question but no solution.
> Kindly help here.
> Below is the exception. 
>  
> ERROR StreamingListenerBus: Listener JavaStreamingListenerWrapper threw an 
> exception
> py4j.Py4JException: An exception was raised by the Python Proxy. Return 
> Message: x
>  at py4j.Protocol.getReturnValue(Protocol.java:438)
>  at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:105)
>  at com.sun.proxy.$Proxy10.onBatchCompleted(Unknown Source)
>  at 
> org.apache.spark.streaming.api.java.PythonStreamingListenerWrapper.onBatchCompleted(JavaStreamingListener.scala:89)
>  at 
> org.apache.spark.streaming.api.java.JavaStreamingListenerWrapper.onBatchCompleted(JavaStreamingListenerWrapper.scala:111)
>  at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:63)
>  at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:29)
>  at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
>  at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.postToAll(StreamingListenerBus.scala:29)
>  at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.onOtherEvent(StreamingListenerBus.scala:43)
>  at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:75)
>  at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>  at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>  at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
>  at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
>  at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94)
>  at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>  at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>  at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>  at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78)
>  at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1279)
>  at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77)
>  
> Below is the link for SO
> https://stackoverflow.com/questions/47780794/py4jexception-using-pyspark-streaminglistener/52858375#52858375



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25759) StreamingListenerBus: Listener JavaStreamingListenerWrapper threw an exception

2018-10-17 Thread Parhy (JIRA)
Parhy created SPARK-25759:
-

 Summary: StreamingListenerBus: Listener 
JavaStreamingListenerWrapper threw an exception
 Key: SPARK-25759
 URL: https://issues.apache.org/jira/browse/SPARK-25759
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.0
 Environment: Any
Reporter: Parhy


I am using pyspark with spark Streaming version 2.2.1 . I am using AWS s3 as my 
source.

Once the batch is complete. I would like to remove the files from S3. I have 
extended the StreamingListener class. I can see the listener is called once the 
batch is complete. But I am getting an exception as well. I am putting the 
exception. I did see a stackoverflow question with the exact same question but 
no solution.

Kindly help here.

Below is the exception. 

 

ERROR StreamingListenerBus: Listener JavaStreamingListenerWrapper threw an 
exception
py4j.Py4JException: An exception was raised by the Python Proxy. Return 
Message: x
 at py4j.Protocol.getReturnValue(Protocol.java:438)
 at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:105)
 at com.sun.proxy.$Proxy10.onBatchCompleted(Unknown Source)
 at 
org.apache.spark.streaming.api.java.PythonStreamingListenerWrapper.onBatchCompleted(JavaStreamingListener.scala:89)
 at 
org.apache.spark.streaming.api.java.JavaStreamingListenerWrapper.onBatchCompleted(JavaStreamingListenerWrapper.scala:111)
 at 
org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:63)
 at 
org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:29)
 at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
 at 
org.apache.spark.streaming.scheduler.StreamingListenerBus.postToAll(StreamingListenerBus.scala:29)
 at 
org.apache.spark.streaming.scheduler.StreamingListenerBus.onOtherEvent(StreamingListenerBus.scala:43)
 at 
org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:75)
 at 
org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
 at 
org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
 at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
 at 
org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
 at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94)
 at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
 at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
 at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78)
 at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1279)
 at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77)

 

Below is the link for SO

https://stackoverflow.com/questions/47780794/py4jexception-using-pyspark-streaminglistener/52858375#52858375



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21402) Java encoders - switch fields on collectAsList

2018-10-17 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21402:
---

Assignee: Vladimir Kuriatkov

> Java encoders - switch fields on collectAsList
> --
>
> Key: SPARK-21402
> URL: https://issues.apache.org/jira/browse/SPARK-21402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>Assignee: Vladimir Kuriatkov
>Priority: Major
> Fix For: 2.4.0
>
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20327) Add CLI support for YARN custom resources, like GPUs

2018-10-17 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653750#comment-16653750
 ] 

Imran Rashid commented on SPARK-20327:
--

https://github.com/apache/spark/pull/22751 was merged as well as a followup

> Add CLI support for YARN custom resources, like GPUs
> 
>
> Key: SPARK-20327
> URL: https://issues.apache.org/jira/browse/SPARK-20327
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, Spark Submit
>Affects Versions: 2.1.0
>Reporter: Daniel Templeton
>Assignee: Szilard Nemeth
>Priority: Major
>  Labels: newbie
> Fix For: 3.0.0
>
>
> YARN-3926 adds the ability for administrators to configure custom resources, 
> like GPUs.  This JIRA is to add support to Spark for requesting resources 
> other than CPU virtual cores and memory.  See YARN-3926.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21402) Java encoders - switch fields on collectAsList

2018-10-17 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21402.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22708
[https://github.com/apache/spark/pull/22708]

> Java encoders - switch fields on collectAsList
> --
>
> Key: SPARK-21402
> URL: https://issues.apache.org/jira/browse/SPARK-21402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>Priority: Major
> Fix For: 2.4.0
>
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25758) Deprecate BisectingKMeans compute cost

2018-10-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25758:


Assignee: Apache Spark

> Deprecate BisectingKMeans compute cost
> --
>
> Key: SPARK-25758
> URL: https://issues.apache.org/jira/browse/SPARK-25758
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Apache Spark
>Priority: Minor
>
> In SPARK-23451 the method {{computeCost}} from KMeans was deprecated, as we 
> have now a better way to evaluate a clustering algorithm (the 
> {{ClusteringEvaluator}}). Moreover, in the deprecation, the method was 
> targeted for removal in 3.0.
> I think we should deprecate the computeCost method on BisectingKMeans  for 
> the same reasons.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25758) Deprecate BisectingKMeans compute cost

2018-10-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25758:


Assignee: (was: Apache Spark)

> Deprecate BisectingKMeans compute cost
> --
>
> Key: SPARK-25758
> URL: https://issues.apache.org/jira/browse/SPARK-25758
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Minor
>
> In SPARK-23451 the method {{computeCost}} from KMeans was deprecated, as we 
> have now a better way to evaluate a clustering algorithm (the 
> {{ClusteringEvaluator}}). Moreover, in the deprecation, the method was 
> targeted for removal in 3.0.
> I think we should deprecate the computeCost method on BisectingKMeans  for 
> the same reasons.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25758) Deprecate BisectingKMeans compute cost

2018-10-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653709#comment-16653709
 ] 

Apache Spark commented on SPARK-25758:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/22756

> Deprecate BisectingKMeans compute cost
> --
>
> Key: SPARK-25758
> URL: https://issues.apache.org/jira/browse/SPARK-25758
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Minor
>
> In SPARK-23451 the method {{computeCost}} from KMeans was deprecated, as we 
> have now a better way to evaluate a clustering algorithm (the 
> {{ClusteringEvaluator}}). Moreover, in the deprecation, the method was 
> targeted for removal in 3.0.
> I think we should deprecate the computeCost method on BisectingKMeans  for 
> the same reasons.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25735) Improve start-thriftserver.sh: print clean usage and exit with code 1

2018-10-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25735.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22727
[https://github.com/apache/spark/pull/22727]

> Improve start-thriftserver.sh: print clean usage and exit with code 1
> -
>
> Key: SPARK-25735
> URL: https://issues.apache.org/jira/browse/SPARK-25735
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 3.0.0
>
>
> Currently if we run 
> sh start-thriftserver.sh -h
> we get 
> ...
> Thrift server options:
> 2018-10-15 21:45:39 INFO  HiveThriftServer2:54 - Starting SparkContext
> 2018-10-15 21:45:40 INFO  SparkContext:54 - Running Spark version 2.3.2
> 2018-10-15 21:45:40 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 2018-10-15 21:45:40 ERROR SparkContext:91 - Error initializing SparkContext.
> org.apache.spark.SparkException: A master URL must be set in your 
> configuration
>   at org.apache.spark.SparkContext.(SparkContext.scala:367)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:934)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:925)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:925)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:48)
>   at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:79)
>   at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
> 2018-10-15 21:45:40 ERROR Utils:91 - Uncaught exception in thread main
> After fix, the usage output is clean:
> Thrift server options:
> --hiveconfUse value for given property
> Also exit with code 1, to follow other scripts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25735) Improve start-thriftserver.sh: print clean usage and exit with code 1

2018-10-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25735:
-

Assignee: Gengliang Wang

> Improve start-thriftserver.sh: print clean usage and exit with code 1
> -
>
> Key: SPARK-25735
> URL: https://issues.apache.org/jira/browse/SPARK-25735
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 3.0.0
>
>
> Currently if we run 
> sh start-thriftserver.sh -h
> we get 
> ...
> Thrift server options:
> 2018-10-15 21:45:39 INFO  HiveThriftServer2:54 - Starting SparkContext
> 2018-10-15 21:45:40 INFO  SparkContext:54 - Running Spark version 2.3.2
> 2018-10-15 21:45:40 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 2018-10-15 21:45:40 ERROR SparkContext:91 - Error initializing SparkContext.
> org.apache.spark.SparkException: A master URL must be set in your 
> configuration
>   at org.apache.spark.SparkContext.(SparkContext.scala:367)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:934)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:925)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:925)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:48)
>   at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:79)
>   at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
> 2018-10-15 21:45:40 ERROR Utils:91 - Uncaught exception in thread main
> After fix, the usage output is clean:
> Thrift server options:
> --hiveconfUse value for given property
> Also exit with code 1, to follow other scripts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25741) Long URLs are not rendered properly in web UI

2018-10-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25741:
-

Assignee: Gengliang Wang

> Long URLs are not rendered properly in web UI
> -
>
> Key: SPARK-25741
> URL: https://issues.apache.org/jira/browse/SPARK-25741
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 2.4.0
>
>
> When the URL for description column in the table of job/stage page is long, 
> WebUI doesn't render it properly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25741) Long URLs are not rendered properly in web UI

2018-10-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25741.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22744
[https://github.com/apache/spark/pull/22744]

> Long URLs are not rendered properly in web UI
> -
>
> Key: SPARK-25741
> URL: https://issues.apache.org/jira/browse/SPARK-25741
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 2.4.0
>
>
> When the URL for description column in the table of job/stage page is long, 
> WebUI doesn't render it properly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25758) Deprecate BisectingKMeans compute cost

2018-10-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25758:
--
Priority: Minor  (was: Major)

(Not major). I agree with deprecating it. It does not need to block 2.4.

> Deprecate BisectingKMeans compute cost
> --
>
> Key: SPARK-25758
> URL: https://issues.apache.org/jira/browse/SPARK-25758
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Minor
>
> In SPARK-23451 the method {{computeCost}} from KMeans was deprecated, as we 
> have now a better way to evaluate a clustering algorithm (the 
> {{ClusteringEvaluator}}). Moreover, in the deprecation, the method was 
> targeted for removal in 3.0.
> I think we should deprecate the computeCost method on BisectingKMeans  for 
> the same reasons.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25758) Deprecate BisectingKMeans compute cost

2018-10-17 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653641#comment-16653641
 ] 

Marco Gaido commented on SPARK-25758:
-

cc [~cloud_fan] [~srowen] [~holdenkarau]. This is a minor thing but might 
become a blocker for 2.4 if we want to deprecate it there so that we can remove 
it in 3.0 as planned for {{KMeans.computeCost}}.

> Deprecate BisectingKMeans compute cost
> --
>
> Key: SPARK-25758
> URL: https://issues.apache.org/jira/browse/SPARK-25758
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Major
>
> In SPARK-23451 the method {{computeCost}} from KMeans was deprecated, as we 
> have now a better way to evaluate a clustering algorithm (the 
> {{ClusteringEvaluator}}). Moreover, in the deprecation, the method was 
> targeted for removal in 3.0.
> I think we should deprecate the computeCost method on BisectingKMeans  for 
> the same reasons.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25758) Deprecate BisectingKMeans compute cost

2018-10-17 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-25758:
---

 Summary: Deprecate BisectingKMeans compute cost
 Key: SPARK-25758
 URL: https://issues.apache.org/jira/browse/SPARK-25758
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.4.0
Reporter: Marco Gaido


In SPARK-23451 the method {{computeCost}} from KMeans was deprecated, as we 
have now a better way to evaluate a clustering algorithm (the 
{{ClusteringEvaluator}}). Moreover, in the deprecation, the method was targeted 
for removal in 3.0.

I think we should deprecate the computeCost method on BisectingKMeans  for the 
same reasons.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25739) Double quote coming in as empty value even when emptyValue set as null

2018-10-17 Thread Brian Jones (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653623#comment-16653623
 ] 

Brian Jones commented on SPARK-25739:
-

[~hyukjin.kwon] Yes.  I agree, that is a better solution then adding a new 
feature.

> Double quote coming in as empty value even when emptyValue set as null
> --
>
> Key: SPARK-25739
> URL: https://issues.apache.org/jira/browse/SPARK-25739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment:  Databricks - 4.2 (includes Apache Spark 2.3.1, Scala 
> 2.11) 
>Reporter: Brian Jones
>Priority: Major
>
>  Example code - 
> {code:java}
> val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
> df
> .repartition(1)
> .write
> .mode("overwrite")
> .option("nullValue", null)
> .option("emptyValue", null)
> .option("delimiter",",")
> .option("quoteMode", "NONE")
> .option("escape","\\")
> .format("csv")
> .save("/tmp/nullcsv/")
> var out = dbutils.fs.ls("/tmp/nullcsv/")
> var file = out(out.size - 1)
> val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
> println(x)
> {code}
> Output - 
> {code:java}
> 1,""
> 3,hi
> 2,hello
> 4,
> {code}
> Expected output - 
> {code:java}
> 1,
> 3,hi
> 2,hello
> 4,
> {code}
>  
> [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe]
>  This commit is relevant to my issue.
> "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
> version 2.3 and earlier, empty strings are equal to `null` values and do not 
> reflect to any characters in saved CSV files."
> I am on Spark version 2.3.1, so empty strings should be coming as null.  Even 
> then, I am passing the correct "emptyValue" option.  However, my empty values 
> are stilling coming as `""` in the written file.
>  
> I have tested the provided code in Databricks runtime environment 5.0 and 
> 4.1, and it is giving the expected output.   However in Databricks runtime 
> 4.2 and 4.3 (which are running spark 2.3.1) we get the incorrect output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25739) Double quote coming in as empty value even when emptyValue set as null

2018-10-17 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653618#comment-16653618
 ] 

Hyukjin Kwon commented on SPARK-25739:
--

If so we should identify the patch introducing that behaviour change and revert 
it in 2.3.x, rather then porting the new feature.

> Double quote coming in as empty value even when emptyValue set as null
> --
>
> Key: SPARK-25739
> URL: https://issues.apache.org/jira/browse/SPARK-25739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment:  Databricks - 4.2 (includes Apache Spark 2.3.1, Scala 
> 2.11) 
>Reporter: Brian Jones
>Priority: Major
>
>  Example code - 
> {code:java}
> val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
> df
> .repartition(1)
> .write
> .mode("overwrite")
> .option("nullValue", null)
> .option("emptyValue", null)
> .option("delimiter",",")
> .option("quoteMode", "NONE")
> .option("escape","\\")
> .format("csv")
> .save("/tmp/nullcsv/")
> var out = dbutils.fs.ls("/tmp/nullcsv/")
> var file = out(out.size - 1)
> val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
> println(x)
> {code}
> Output - 
> {code:java}
> 1,""
> 3,hi
> 2,hello
> 4,
> {code}
> Expected output - 
> {code:java}
> 1,
> 3,hi
> 2,hello
> 4,
> {code}
>  
> [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe]
>  This commit is relevant to my issue.
> "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
> version 2.3 and earlier, empty strings are equal to `null` values and do not 
> reflect to any characters in saved CSV files."
> I am on Spark version 2.3.1, so empty strings should be coming as null.  Even 
> then, I am passing the correct "emptyValue" option.  However, my empty values 
> are stilling coming as `""` in the written file.
>  
> I have tested the provided code in Databricks runtime environment 5.0 and 
> 4.1, and it is giving the expected output.   However in Databricks runtime 
> 4.2 and 4.3 (which are running spark 2.3.1) we get the incorrect output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25739) Double quote coming in as empty value even when emptyValue set as null

2018-10-17 Thread Brian Jones (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653617#comment-16653617
 ] 

Brian Jones commented on SPARK-25739:
-

[~hyukjin.kwon] Correct. The code I gave gives the expected output in 2.3.0 and 
2.4.0 but not for 2.3.1, which doesn't make any sense for me.

> Double quote coming in as empty value even when emptyValue set as null
> --
>
> Key: SPARK-25739
> URL: https://issues.apache.org/jira/browse/SPARK-25739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment:  Databricks - 4.2 (includes Apache Spark 2.3.1, Scala 
> 2.11) 
>Reporter: Brian Jones
>Priority: Major
>
>  Example code - 
> {code:java}
> val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
> df
> .repartition(1)
> .write
> .mode("overwrite")
> .option("nullValue", null)
> .option("emptyValue", null)
> .option("delimiter",",")
> .option("quoteMode", "NONE")
> .option("escape","\\")
> .format("csv")
> .save("/tmp/nullcsv/")
> var out = dbutils.fs.ls("/tmp/nullcsv/")
> var file = out(out.size - 1)
> val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
> println(x)
> {code}
> Output - 
> {code:java}
> 1,""
> 3,hi
> 2,hello
> 4,
> {code}
> Expected output - 
> {code:java}
> 1,
> 3,hi
> 2,hello
> 4,
> {code}
>  
> [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe]
>  This commit is relevant to my issue.
> "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
> version 2.3 and earlier, empty strings are equal to `null` values and do not 
> reflect to any characters in saved CSV files."
> I am on Spark version 2.3.1, so empty strings should be coming as null.  Even 
> then, I am passing the correct "emptyValue" option.  However, my empty values 
> are stilling coming as `""` in the written file.
>  
> I have tested the provided code in Databricks runtime environment 5.0 and 
> 4.1, and it is giving the expected output.   However in Databricks runtime 
> 4.2 and 4.3 (which are running spark 2.3.1) we get the incorrect output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25739) Double quote coming in as empty value even when emptyValue set as null

2018-10-17 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653615#comment-16653615
 ] 

Hyukjin Kwon commented on SPARK-25739:
--

Oh you mean it does work at Spark 2.3.0 but not at Spark 2.3.1?

> Double quote coming in as empty value even when emptyValue set as null
> --
>
> Key: SPARK-25739
> URL: https://issues.apache.org/jira/browse/SPARK-25739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment:  Databricks - 4.2 (includes Apache Spark 2.3.1, Scala 
> 2.11) 
>Reporter: Brian Jones
>Priority: Major
>
>  Example code - 
> {code:java}
> val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
> df
> .repartition(1)
> .write
> .mode("overwrite")
> .option("nullValue", null)
> .option("emptyValue", null)
> .option("delimiter",",")
> .option("quoteMode", "NONE")
> .option("escape","\\")
> .format("csv")
> .save("/tmp/nullcsv/")
> var out = dbutils.fs.ls("/tmp/nullcsv/")
> var file = out(out.size - 1)
> val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
> println(x)
> {code}
> Output - 
> {code:java}
> 1,""
> 3,hi
> 2,hello
> 4,
> {code}
> Expected output - 
> {code:java}
> 1,
> 3,hi
> 2,hello
> 4,
> {code}
>  
> [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe]
>  This commit is relevant to my issue.
> "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
> version 2.3 and earlier, empty strings are equal to `null` values and do not 
> reflect to any characters in saved CSV files."
> I am on Spark version 2.3.1, so empty strings should be coming as null.  Even 
> then, I am passing the correct "emptyValue" option.  However, my empty values 
> are stilling coming as `""` in the written file.
>  
> I have tested the provided code in Databricks runtime environment 5.0 and 
> 4.1, and it is giving the expected output.   However in Databricks runtime 
> 4.2 and 4.3 (which are running spark 2.3.1) we get the incorrect output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25739) Double quote coming in as empty value even when emptyValue set as null

2018-10-17 Thread Brian Jones (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653604#comment-16653604
 ] 

Brian Jones commented on SPARK-25739:
-

[~hyukjin.kwon] Alright.  But then why is it working on 2.3 then?

> Double quote coming in as empty value even when emptyValue set as null
> --
>
> Key: SPARK-25739
> URL: https://issues.apache.org/jira/browse/SPARK-25739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment:  Databricks - 4.2 (includes Apache Spark 2.3.1, Scala 
> 2.11) 
>Reporter: Brian Jones
>Priority: Major
>
>  Example code - 
> {code:java}
> val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
> df
> .repartition(1)
> .write
> .mode("overwrite")
> .option("nullValue", null)
> .option("emptyValue", null)
> .option("delimiter",",")
> .option("quoteMode", "NONE")
> .option("escape","\\")
> .format("csv")
> .save("/tmp/nullcsv/")
> var out = dbutils.fs.ls("/tmp/nullcsv/")
> var file = out(out.size - 1)
> val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
> println(x)
> {code}
> Output - 
> {code:java}
> 1,""
> 3,hi
> 2,hello
> 4,
> {code}
> Expected output - 
> {code:java}
> 1,
> 3,hi
> 2,hello
> 4,
> {code}
>  
> [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe]
>  This commit is relevant to my issue.
> "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
> version 2.3 and earlier, empty strings are equal to `null` values and do not 
> reflect to any characters in saved CSV files."
> I am on Spark version 2.3.1, so empty strings should be coming as null.  Even 
> then, I am passing the correct "emptyValue" option.  However, my empty values 
> are stilling coming as `""` in the written file.
>  
> I have tested the provided code in Databricks runtime environment 5.0 and 
> 4.1, and it is giving the expected output.   However in Databricks runtime 
> 4.2 and 4.3 (which are running spark 2.3.1) we get the incorrect output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25739) Double quote coming in as empty value even when emptyValue set as null

2018-10-17 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653599#comment-16653599
 ] 

Hyukjin Kwon commented on SPARK-25739:
--

New features are not ported for maintenance version bump up. This will 
otherwise break other user's app using Spark 2.3.x.

> Double quote coming in as empty value even when emptyValue set as null
> --
>
> Key: SPARK-25739
> URL: https://issues.apache.org/jira/browse/SPARK-25739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment:  Databricks - 4.2 (includes Apache Spark 2.3.1, Scala 
> 2.11) 
>Reporter: Brian Jones
>Priority: Major
>
>  Example code - 
> {code:java}
> val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
> df
> .repartition(1)
> .write
> .mode("overwrite")
> .option("nullValue", null)
> .option("emptyValue", null)
> .option("delimiter",",")
> .option("quoteMode", "NONE")
> .option("escape","\\")
> .format("csv")
> .save("/tmp/nullcsv/")
> var out = dbutils.fs.ls("/tmp/nullcsv/")
> var file = out(out.size - 1)
> val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
> println(x)
> {code}
> Output - 
> {code:java}
> 1,""
> 3,hi
> 2,hello
> 4,
> {code}
> Expected output - 
> {code:java}
> 1,
> 3,hi
> 2,hello
> 4,
> {code}
>  
> [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe]
>  This commit is relevant to my issue.
> "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
> version 2.3 and earlier, empty strings are equal to `null` values and do not 
> reflect to any characters in saved CSV files."
> I am on Spark version 2.3.1, so empty strings should be coming as null.  Even 
> then, I am passing the correct "emptyValue" option.  However, my empty values 
> are stilling coming as `""` in the written file.
>  
> I have tested the provided code in Databricks runtime environment 5.0 and 
> 4.1, and it is giving the expected output.   However in Databricks runtime 
> 4.2 and 4.3 (which are running spark 2.3.1) we get the incorrect output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25739) Double quote coming in as empty value even when emptyValue set as null

2018-10-17 Thread Brian Jones (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653595#comment-16653595
 ] 

Brian Jones commented on SPARK-25739:
-

[~hyukjin.kwon] Correct.  However, we are not able to upgrade to spark 2.4.  We 
must use 2.3.1.  However, it works fine on 2.3 and 2.4, only 2.3.1 is not 
working. 

> Double quote coming in as empty value even when emptyValue set as null
> --
>
> Key: SPARK-25739
> URL: https://issues.apache.org/jira/browse/SPARK-25739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment:  Databricks - 4.2 (includes Apache Spark 2.3.1, Scala 
> 2.11) 
>Reporter: Brian Jones
>Priority: Major
>
>  Example code - 
> {code:java}
> val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
> df
> .repartition(1)
> .write
> .mode("overwrite")
> .option("nullValue", null)
> .option("emptyValue", null)
> .option("delimiter",",")
> .option("quoteMode", "NONE")
> .option("escape","\\")
> .format("csv")
> .save("/tmp/nullcsv/")
> var out = dbutils.fs.ls("/tmp/nullcsv/")
> var file = out(out.size - 1)
> val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
> println(x)
> {code}
> Output - 
> {code:java}
> 1,""
> 3,hi
> 2,hello
> 4,
> {code}
> Expected output - 
> {code:java}
> 1,
> 3,hi
> 2,hello
> 4,
> {code}
>  
> [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe]
>  This commit is relevant to my issue.
> "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
> version 2.3 and earlier, empty strings are equal to `null` values and do not 
> reflect to any characters in saved CSV files."
> I am on Spark version 2.3.1, so empty strings should be coming as null.  Even 
> then, I am passing the correct "emptyValue" option.  However, my empty values 
> are stilling coming as `""` in the written file.
>  
> I have tested the provided code in Databricks runtime environment 5.0 and 
> 4.1, and it is giving the expected output.   However in Databricks runtime 
> 4.2 and 4.3 (which are running spark 2.3.1) we get the incorrect output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25757) Upgrade netty-all from 4.1.17.Final to 4.1.30.Final

2018-10-17 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-25757:

Affects Version/s: (was: 2.3.2)
   3.0.0

> Upgrade netty-all from 4.1.17.Final to 4.1.30.Final
> ---
>
> Key: SPARK-25757
> URL: https://issues.apache.org/jira/browse/SPARK-25757
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Priority: Minor
>
> Upgrade netty from 4.1.17.Final to 4.1.30.Final to fix some netty version 
> bugs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25757) Upgrade netty-all from 4.1.17.Final to 4.1.30.Final

2018-10-17 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-25757:

Component/s: (was: Deploy)
 Build

> Upgrade netty-all from 4.1.17.Final to 4.1.30.Final
> ---
>
> Key: SPARK-25757
> URL: https://issues.apache.org/jira/browse/SPARK-25757
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Priority: Minor
>
> Upgrade netty from 4.1.17.Final to 4.1.30.Final to fix some netty version 
> bugs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25757) Upgrade netty-all from 4.1.17.Final to 4.1.30.Final

2018-10-17 Thread Zhu, Lipeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653460#comment-16653460
 ] 

Zhu, Lipeng commented on SPARK-25757:
-

I am working on this.

> Upgrade netty-all from 4.1.17.Final to 4.1.30.Final
> ---
>
> Key: SPARK-25757
> URL: https://issues.apache.org/jira/browse/SPARK-25757
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Deploy
>Affects Versions: 2.3.2
>Reporter: Zhu, Lipeng
>Priority: Minor
>
> Upgrade netty from 4.1.17.Final to 4.1.30.Final to fix some netty version 
> bugs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25757) Upgrade netty-all from 4.1.17.Final to 4.1.30.Final

2018-10-17 Thread Zhu Lipeng (JIRA)
Zhu Lipeng created SPARK-25757:
--

 Summary: Upgrade netty-all from 4.1.17.Final to 4.1.30.Final
 Key: SPARK-25757
 URL: https://issues.apache.org/jira/browse/SPARK-25757
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Deploy
Affects Versions: 2.3.2
Reporter: Zhu Lipeng


Upgrade netty from 4.1.17.Final to 4.1.30.Final to fix some netty version bugs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25754) Change CDN for MathJax

2018-10-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25754.
---
   Resolution: Fixed
 Assignee: Gengliang Wang
Fix Version/s: 2.4.0
   2.3.3

Resolved by https://github.com/apache/spark/pull/22753

> Change CDN for MathJax 
> ---
>
> Key: SPARK-25754
> URL: https://issues.apache.org/jira/browse/SPARK-25754
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Trivial
> Fix For: 2.3.3, 2.4.0
>
>
> Currently when we open our doc site: 
> https://spark.apache.org/docs/latest/index.html , there is one warning 
> WARNING: cdn.mathjax.org has been retired. Check 
> https://www.mathjax.org/cdn-shutting-down/ for migration tips.
> Change the CDN as per the migration tips.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25756) pyspark pandas_udf does not respect append outputMode in structured streaming

2018-10-17 Thread Jan Bols (JIRA)
Jan Bols created SPARK-25756:


 Summary: pyspark pandas_udf does not respect append outputMode in 
structured streaming
 Key: SPARK-25756
 URL: https://issues.apache.org/jira/browse/SPARK-25756
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Structured Streaming
Affects Versions: 2.3.2
Reporter: Jan Bols


When using the following setup:
 * structured streaming
 * a watermark and groupBy followed by an apply using a pandas grouped map udf
 * a sink using an append outputMode

I would expect the following:
 * udf to be called for each group --> OK
 * when new data arrives, the udf will be called again –> OK
 * when new data arrives for the same group, the udf will be called with the 
complete pandas dataframe of all received data for that group (up till the 
watermark) --> NOK: within the same group, the size of the pandas dataframe can 
decrease between invocations
 * the results are only written to the sink once the processing time is passed 
the watermark --> NOK: every time the udf is called, new results are being sent 
to the output

It looks like pandas udf is unusable for structured streaming this way.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25749) Exception thrown while reading avro file with large schema

2018-10-17 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25749:
-
Priority: Major  (was: Blocker)

> Exception thrown while reading avro file with large schema
> --
>
> Key: SPARK-25749
> URL: https://issues.apache.org/jira/browse/SPARK-25749
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
>Reporter: Raj
>Priority: Major
> Attachments: EncoderExample.scala, MainCC.scala, build.sbt, exception
>
>
> Hi, We are migrating our jobs from Spark 2.2.0 to Spark 2.3.1. One of the job 
> reads avro source that has large nested schema. The job fails for Spark 
> 2.3.1(Have tested in Spark 2.3.0 & Spark 2.3.2 and the job fails in this case 
> also). I am able to replicate this with some sample data + dummy case class. 
> Please find attached the,
> *Code*: EncoderExample.scala, MainCC.scala & build.sbt
> *Exception log*: exception
> PS:
> I am getting exception \{{java.lang.OutOfMemoryError: Java heap space}}. I 
> have tried increasing the JVM size in eclipse, but that does not help either
> I have also tested the code in Spark 2.2.2 and works fine. Seems like this 
> bug introduced in Spark 2.3.0
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25749) Exception thrown while reading avro file with large schema

2018-10-17 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25749.
--
Resolution: Invalid

> Exception thrown while reading avro file with large schema
> --
>
> Key: SPARK-25749
> URL: https://issues.apache.org/jira/browse/SPARK-25749
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
>Reporter: Raj
>Priority: Major
> Attachments: EncoderExample.scala, MainCC.scala, build.sbt, exception
>
>
> Hi, We are migrating our jobs from Spark 2.2.0 to Spark 2.3.1. One of the job 
> reads avro source that has large nested schema. The job fails for Spark 
> 2.3.1(Have tested in Spark 2.3.0 & Spark 2.3.2 and the job fails in this case 
> also). I am able to replicate this with some sample data + dummy case class. 
> Please find attached the,
> *Code*: EncoderExample.scala, MainCC.scala & build.sbt
> *Exception log*: exception
> PS:
> I am getting exception \{{java.lang.OutOfMemoryError: Java heap space}}. I 
> have tried increasing the JVM size in eclipse, but that does not help either
> I have also tested the code in Spark 2.2.2 and works fine. Seems like this 
> bug introduced in Spark 2.3.0
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25749) Exception thrown while reading avro file with large schema

2018-10-17 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653209#comment-16653209
 ] 

Hyukjin Kwon commented on SPARK-25749:
--

Please avoid set the priority, Critical+ which is usually reserved for 
committers. For the issue itself, as of Spark 2.3.x, it's external datasource. 
The issue should better be asked to databricks/spark-avro.

> Exception thrown while reading avro file with large schema
> --
>
> Key: SPARK-25749
> URL: https://issues.apache.org/jira/browse/SPARK-25749
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
>Reporter: Raj
>Priority: Blocker
> Attachments: EncoderExample.scala, MainCC.scala, build.sbt, exception
>
>
> Hi, We are migrating our jobs from Spark 2.2.0 to Spark 2.3.1. One of the job 
> reads avro source that has large nested schema. The job fails for Spark 
> 2.3.1(Have tested in Spark 2.3.0 & Spark 2.3.2 and the job fails in this case 
> also). I am able to replicate this with some sample data + dummy case class. 
> Please find attached the,
> *Code*: EncoderExample.scala, MainCC.scala & build.sbt
> *Exception log*: exception
> PS:
> I am getting exception \{{java.lang.OutOfMemoryError: Java heap space}}. I 
> have tried increasing the JVM size in eclipse, but that does not help either
> I have also tested the code in Spark 2.2.2 and works fine. Seems like this 
> bug introduced in Spark 2.3.0
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25739) Double quote coming in as empty value even when emptyValue set as null

2018-10-17 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25739.
--
Resolution: Duplicate

> Double quote coming in as empty value even when emptyValue set as null
> --
>
> Key: SPARK-25739
> URL: https://issues.apache.org/jira/browse/SPARK-25739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment:  Databricks - 4.2 (includes Apache Spark 2.3.1, Scala 
> 2.11) 
>Reporter: Brian Jones
>Priority: Major
>
>  Example code - 
> {code:java}
> val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
> df
> .repartition(1)
> .write
> .mode("overwrite")
> .option("nullValue", null)
> .option("emptyValue", null)
> .option("delimiter",",")
> .option("quoteMode", "NONE")
> .option("escape","\\")
> .format("csv")
> .save("/tmp/nullcsv/")
> var out = dbutils.fs.ls("/tmp/nullcsv/")
> var file = out(out.size - 1)
> val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
> println(x)
> {code}
> Output - 
> {code:java}
> 1,""
> 3,hi
> 2,hello
> 4,
> {code}
> Expected output - 
> {code:java}
> 1,
> 3,hi
> 2,hello
> 4,
> {code}
>  
> [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe]
>  This commit is relevant to my issue.
> "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
> version 2.3 and earlier, empty strings are equal to `null` values and do not 
> reflect to any characters in saved CSV files."
> I am on Spark version 2.3.1, so empty strings should be coming as null.  Even 
> then, I am passing the correct "emptyValue" option.  However, my empty values 
> are stilling coming as `""` in the written file.
>  
> I have tested the provided code in Databricks runtime environment 5.0 and 
> 4.1, and it is giving the expected output.   However in Databricks runtime 
> 4.2 and 4.3 (which are running spark 2.3.1) we get the incorrect output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25731) Spark Structured Streaming Support for Kafka 2.0

2018-10-17 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25731.
--
Resolution: Duplicate

> Spark Structured Streaming Support for Kafka 2.0
> 
>
> Key: SPARK-25731
> URL: https://issues.apache.org/jira/browse/SPARK-25731
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.2
>Reporter: Chandan
>Priority: Major
>  Labels: beginner, features
>
> [https://github.com/apache/spark/tree/master/external]
> As far as I can see, 
>  This doesn't have support for newly release *kafka2.0,*
>  support is available only till *kafka-0-10.*
> If we use the 
> "org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.3.0"
> for kafka2.0, below is the error I get
> 11:46:18.061 [stream execution thread for [id = 
> e393ea37-8009-4ce0-b996-94f767994fb8, runId = 
> bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG 
> org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, 
> groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0]
>  *Completed connection to node -1. Fetching API versions.*
>  11:46:18.061 [stream execution thread for [id = 
> e393ea37-8009-4ce0-b996-94f767994fb8, runId = 
> bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG 
> org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, 
> groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0]
>  *Initiating API versions fetch from node -1.*
>  11:46:18.452 [stream execution thread for [id = 
> e393ea37-8009-4ce0-b996-94f767994fb8, runId = 
> bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG 
> org.apache.kafka.common.network.Selector - [Consumer clientId=consumer-2, 
> groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0]
>  Connection with *kafka-muhammad-45e0.aivencloud.com/18.203.67.147 
> disconnected*
>  *java.io.EOFException: null*
>  at 
> org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:119)
>  at 
> org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:335)
>  at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:296)
>  
>  
> I might be wrong, but this is the best option I thought to open an issue. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25733) The method toLocalIterator() with dataframe doesn't work

2018-10-17 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25733.
--
Resolution: Duplicate

> The method toLocalIterator() with dataframe doesn't work
> 
>
> Key: SPARK-25733
> URL: https://issues.apache.org/jira/browse/SPARK-25733
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: Spark in standalone mode, and 48 cores are available.
> spark-defaults.conf as blew:
> spark.pyshark.python /usr/bin/python3.6
> spark.driver.memory 4g
> spark.executor.memory 8g
>  
> other configurations are at default.
>Reporter: Bihui Jin
>Priority: Major
> Attachments: report_dataset.zip.001, report_dataset.zip.002
>
>
> {color:#FF}The dataset which I used attached.{color}
>  
> First I loaded a dataframe from local disk:
> df = spark.read.load('report_dataset')
> there are about 200 partitions stored in s3, and the max size of partitions 
> is 28.37MB.
>  
> after data loaded,  I execute "df.take(1)" to test the dataframe, and 
> expected output printed 
> "[Row(s3_link='https://dcm-ul-phy.s3-china-1.eecloud.nsn-net.net/normal/run2/pool1/Tests.NbIot.NBCellSetupDelete.LTE3374_CellSetup_4x5M_2RX_3CELevel_Loop100.html',
>  sequences=[364, 15, 184, 34, 524, 49, 30, 527, 44, 366, 125, 85, 69, 524, 
> 49, 389, 575, 29, 179, 447, 168, 3, 223, 116, 573, 524, 49, 30, 527, 56, 366, 
> 125, 85, 524, 118, 295, 440, 123, 389, 32, 575, 529, 192, 524, 49, 389, 575, 
> 29, 179, 29, 140, 268, 96, 508, 389, 32, 575, 529, 192, 524, 49, 389, 575, 
> 29, 179, 180, 451, 69, 286, 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, 
> 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, 524, 49, 389, 575, 29, 42, 553, 
> 451, 368, 125, 88, 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, 
> 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, 588, 524, 49, 389], 
> next_word=575, line_num=12)]" 
>  
> Then I try to convert dataframe to the local iterator and want to print one 
> row in dataframe for testing, and blew code is used:
> for row in df.toLocalIterator():
>     print(row)
>     break
> {color:#ff}*But there is no output printed after that code 
> executed.*{color}
>  
> Then I execute "df.take(1)" and blew error is reported:
> ERROR:root:Exception while sending command.
> Traceback (most recent call last):
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> ERROR:root:Exception while sending command.
> Traceback (most recent call last):
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 985, in send_command
> response = connection.send_command(command)
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1164, in send_command
> "Error while receiving", e, proto.ERROR_ON_RECEIVE)
> py4j.protocol.Py4JNetworkError: Error while receiving
> ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java 
> server (127.0.0.1:37735)
> Traceback (most recent call last):
> File 
> "/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", 
> line 2963, in run_code
> exec(code_obj, self.user_global_ns, self.user_ns)
> File "", line 1, in 
> df.take(1)
> File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 
> 504, in take
> return self.limit(num).collect()
> File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 
> 493, in limit
> jdf = self._jdf.limit(num)
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1257, in __call__
> answer, self.gateway_client, self.target_id, self.name)
> File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, 
> in deco
> return f(*a, **kw)
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/protocol.py", line 336, in 
> get_return_value
> format(target_id, ".", name))
> py4j.protocol.Py4JError: An error occurred while calling o29.limit
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
> File 
> "/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", 
> line 1863, in showtraceback
> stb = value._render_traceback_()
> AttributeError: 'Py4JError' object has

[jira] [Resolved] (SPARK-25742) Is there a way to pass the Azure blob storage credentials to the spark for k8s init-container?

2018-10-17 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25742.
--
Resolution: Invalid

Questions should go to mailing list. Please see 
https://spark.apache.org/community.html

> Is there a way to pass the Azure blob storage credentials to the spark for 
> k8s init-container?
> --
>
> Key: SPARK-25742
> URL: https://issues.apache.org/jira/browse/SPARK-25742
> Project: Spark
>  Issue Type: Question
>  Components: Kubernetes
>Affects Versions: 2.3.2
>Reporter: Oscar Bonilla
>Priority: Minor
>
> I'm trying to run spark on a kubernetes cluster in Azure. The idea is to 
> store the Spark application jars and dependencies in a container in Azure 
> Blob Storage.
> I've tried to do this with a public container and this works OK, but when 
> having a private Blob Storage container, the spark-init init container 
> doesn't download the jars.
> The equivalent in AWS S3 is as simple as adding the key_id and secret as 
> environment variables, but I don't see how to do this for Azure Blob Storage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25739) Double quote coming in as empty value even when emptyValue set as null

2018-10-17 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653203#comment-16653203
 ] 

Hyukjin Kwon commented on SPARK-25739:
--

So this is fixed in Spark 2.4, right? That option is added from Spark 2.4. See 
SPARK-25241

> Double quote coming in as empty value even when emptyValue set as null
> --
>
> Key: SPARK-25739
> URL: https://issues.apache.org/jira/browse/SPARK-25739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment:  Databricks - 4.2 (includes Apache Spark 2.3.1, Scala 
> 2.11) 
>Reporter: Brian Jones
>Priority: Major
>
>  Example code - 
> {code:java}
> val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
> df
> .repartition(1)
> .write
> .mode("overwrite")
> .option("nullValue", null)
> .option("emptyValue", null)
> .option("delimiter",",")
> .option("quoteMode", "NONE")
> .option("escape","\\")
> .format("csv")
> .save("/tmp/nullcsv/")
> var out = dbutils.fs.ls("/tmp/nullcsv/")
> var file = out(out.size - 1)
> val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
> println(x)
> {code}
> Output - 
> {code:java}
> 1,""
> 3,hi
> 2,hello
> 4,
> {code}
> Expected output - 
> {code:java}
> 1,
> 3,hi
> 2,hello
> 4,
> {code}
>  
> [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe]
>  This commit is relevant to my issue.
> "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
> version 2.3 and earlier, empty strings are equal to `null` values and do not 
> reflect to any characters in saved CSV files."
> I am on Spark version 2.3.1, so empty strings should be coming as null.  Even 
> then, I am passing the correct "emptyValue" option.  However, my empty values 
> are stilling coming as `""` in the written file.
>  
> I have tested the provided code in Databricks runtime environment 5.0 and 
> 4.1, and it is giving the expected output.   However in Databricks runtime 
> 4.2 and 4.3 (which are running spark 2.3.1) we get the incorrect output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25731) Spark Structured Streaming Support for Kafka 2.0

2018-10-17 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653200#comment-16653200
 ] 

Hyukjin Kwon commented on SPARK-25731:
--

duplicate of SPARK-18057

> Spark Structured Streaming Support for Kafka 2.0
> 
>
> Key: SPARK-25731
> URL: https://issues.apache.org/jira/browse/SPARK-25731
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.2
>Reporter: Chandan
>Priority: Major
>  Labels: beginner, features
>
> [https://github.com/apache/spark/tree/master/external]
> As far as I can see, 
>  This doesn't have support for newly release *kafka2.0,*
>  support is available only till *kafka-0-10.*
> If we use the 
> "org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.3.0"
> for kafka2.0, below is the error I get
> 11:46:18.061 [stream execution thread for [id = 
> e393ea37-8009-4ce0-b996-94f767994fb8, runId = 
> bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG 
> org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, 
> groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0]
>  *Completed connection to node -1. Fetching API versions.*
>  11:46:18.061 [stream execution thread for [id = 
> e393ea37-8009-4ce0-b996-94f767994fb8, runId = 
> bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG 
> org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-2, 
> groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0]
>  *Initiating API versions fetch from node -1.*
>  11:46:18.452 [stream execution thread for [id = 
> e393ea37-8009-4ce0-b996-94f767994fb8, runId = 
> bc15eb7d-876d-4e01-8ee5-22205ec7fdcb]] DEBUG 
> org.apache.kafka.common.network.Selector - [Consumer clientId=consumer-2, 
> groupId=spark-kafka-source-8ce7f26f-e342-4b0d-85f1-a9f641b79629-1052905425-driver-0]
>  Connection with *kafka-muhammad-45e0.aivencloud.com/18.203.67.147 
> disconnected*
>  *java.io.EOFException: null*
>  at 
> org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:119)
>  at 
> org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:335)
>  at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:296)
>  
>  
> I might be wrong, but this is the best option I thought to open an issue. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25680) SQL execution listener shouldn't happen on execution thread

2018-10-17 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25680.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22674
[https://github.com/apache/spark/pull/22674]

> SQL execution listener shouldn't happen on execution thread
> ---
>
> Key: SPARK-25680
> URL: https://issues.apache.org/jira/browse/SPARK-25680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25755) Supplementation of non-CodeGen unit tested for BroadcastHashJoinExec

2018-10-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653145#comment-16653145
 ] 

Apache Spark commented on SPARK-25755:
--

User 'heary-cao' has created a pull request for this issue:
https://github.com/apache/spark/pull/22755

> Supplementation of non-CodeGen unit tested for BroadcastHashJoinExec
> 
>
> Key: SPARK-25755
> URL: https://issues.apache.org/jira/browse/SPARK-25755
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.4.1
>Reporter: caoxuewen
>Priority: Minor
>
> Currently, the BroadcastHashJoinExec physical plan supports CodeGen and 
> non-codegen, but only CodeGen code is tested in the unit tests of 
> InnerJoinSuite、OuterJoinSuite、ExistenceJoinSuite, and non-codegen code is not 
> tested. This PR supplements this part of the test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25755) Supplementation of non-CodeGen unit tested for BroadcastHashJoinExec

2018-10-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25755:


Assignee: Apache Spark

> Supplementation of non-CodeGen unit tested for BroadcastHashJoinExec
> 
>
> Key: SPARK-25755
> URL: https://issues.apache.org/jira/browse/SPARK-25755
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.4.1
>Reporter: caoxuewen
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, the BroadcastHashJoinExec physical plan supports CodeGen and 
> non-codegen, but only CodeGen code is tested in the unit tests of 
> InnerJoinSuite、OuterJoinSuite、ExistenceJoinSuite, and non-codegen code is not 
> tested. This PR supplements this part of the test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25755) Supplementation of non-CodeGen unit tested for BroadcastHashJoinExec

2018-10-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25755:


Assignee: (was: Apache Spark)

> Supplementation of non-CodeGen unit tested for BroadcastHashJoinExec
> 
>
> Key: SPARK-25755
> URL: https://issues.apache.org/jira/browse/SPARK-25755
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.4.1
>Reporter: caoxuewen
>Priority: Minor
>
> Currently, the BroadcastHashJoinExec physical plan supports CodeGen and 
> non-codegen, but only CodeGen code is tested in the unit tests of 
> InnerJoinSuite、OuterJoinSuite、ExistenceJoinSuite, and non-codegen code is not 
> tested. This PR supplements this part of the test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25755) Supplementation of non-CodeGen unit tested for BroadcastHashJoinExec

2018-10-17 Thread caoxuewen (JIRA)
caoxuewen created SPARK-25755:
-

 Summary: Supplementation of non-CodeGen unit tested for 
BroadcastHashJoinExec
 Key: SPARK-25755
 URL: https://issues.apache.org/jira/browse/SPARK-25755
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 2.4.1
Reporter: caoxuewen


Currently, the BroadcastHashJoinExec physical plan supports CodeGen and 
non-codegen, but only CodeGen code is tested in the unit tests of 
InnerJoinSuite、OuterJoinSuite、ExistenceJoinSuite, and non-codegen code is not 
tested. This PR supplements this part of the test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25754) Change CDN for MathJax

2018-10-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653081#comment-16653081
 ] 

Apache Spark commented on SPARK-25754:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/22753

> Change CDN for MathJax 
> ---
>
> Key: SPARK-25754
> URL: https://issues.apache.org/jira/browse/SPARK-25754
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Trivial
>
> Currently when we open our doc site: 
> https://spark.apache.org/docs/latest/index.html , there is one warning 
> WARNING: cdn.mathjax.org has been retired. Check 
> https://www.mathjax.org/cdn-shutting-down/ for migration tips.
> Change the CDN as per the migration tips.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25754) Change CDN for MathJax

2018-10-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25754:


Assignee: (was: Apache Spark)

> Change CDN for MathJax 
> ---
>
> Key: SPARK-25754
> URL: https://issues.apache.org/jira/browse/SPARK-25754
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Trivial
>
> Currently when we open our doc site: 
> https://spark.apache.org/docs/latest/index.html , there is one warning 
> WARNING: cdn.mathjax.org has been retired. Check 
> https://www.mathjax.org/cdn-shutting-down/ for migration tips.
> Change the CDN as per the migration tips.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25754) Change CDN for MathJax

2018-10-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25754:


Assignee: Apache Spark

> Change CDN for MathJax 
> ---
>
> Key: SPARK-25754
> URL: https://issues.apache.org/jira/browse/SPARK-25754
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Trivial
>
> Currently when we open our doc site: 
> https://spark.apache.org/docs/latest/index.html , there is one warning 
> WARNING: cdn.mathjax.org has been retired. Check 
> https://www.mathjax.org/cdn-shutting-down/ for migration tips.
> Change the CDN as per the migration tips.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org