[jira] [Commented] (SPARK-23271) Parquet output contains only "_SUCCESS" file after empty DataFrame saving

2018-02-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355097#comment-16355097
 ] 

Apache Spark commented on SPARK-23271:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/20525

> Parquet output contains only "_SUCCESS" file after empty DataFrame saving 
> --
>
> Key: SPARK-23271
> URL: https://issues.apache.org/jira/browse/SPARK-23271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Pavlo Z.
>Priority: Minor
> Attachments: parquet-empty-output.zip
>
>
> Sophisticated case, reproduced only if read empty CSV file without header 
> with assigned schema.
> Steps for reproduce (Scala):
> {code:java}
> val anySchema = StructType(StructField("anyName", StringType, nullable = 
> false) :: Nil)
> val inputDF = spark.read.schema(anySchema).csv(inputFolderWithEmptyCSVFile)
> inputDF.write.parquet(outputFolderName)
> // Exception: org.apache.spark.sql.AnalysisException: Unable to infer schema 
> for Parquet. It must be specified manually.;
> val actualDF = spark.read.parquet(outputFolderName)
>  
> {code}
> *Actual:* Only "_SUCCESS" file in output directory
> *Expected*: at least one Parquet file with schema.
> Project for reproduce is attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23271) Parquet output contains only "_SUCCESS" file after empty DataFrame saving

2018-02-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23271:


Assignee: (was: Apache Spark)

> Parquet output contains only "_SUCCESS" file after empty DataFrame saving 
> --
>
> Key: SPARK-23271
> URL: https://issues.apache.org/jira/browse/SPARK-23271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Pavlo Z.
>Priority: Minor
> Attachments: parquet-empty-output.zip
>
>
> Sophisticated case, reproduced only if read empty CSV file without header 
> with assigned schema.
> Steps for reproduce (Scala):
> {code:java}
> val anySchema = StructType(StructField("anyName", StringType, nullable = 
> false) :: Nil)
> val inputDF = spark.read.schema(anySchema).csv(inputFolderWithEmptyCSVFile)
> inputDF.write.parquet(outputFolderName)
> // Exception: org.apache.spark.sql.AnalysisException: Unable to infer schema 
> for Parquet. It must be specified manually.;
> val actualDF = spark.read.parquet(outputFolderName)
>  
> {code}
> *Actual:* Only "_SUCCESS" file in output directory
> *Expected*: at least one Parquet file with schema.
> Project for reproduce is attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23271) Parquet output contains only "_SUCCESS" file after empty DataFrame saving

2018-02-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23271:


Assignee: Apache Spark

> Parquet output contains only "_SUCCESS" file after empty DataFrame saving 
> --
>
> Key: SPARK-23271
> URL: https://issues.apache.org/jira/browse/SPARK-23271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Pavlo Z.
>Assignee: Apache Spark
>Priority: Minor
> Attachments: parquet-empty-output.zip
>
>
> Sophisticated case, reproduced only if read empty CSV file without header 
> with assigned schema.
> Steps for reproduce (Scala):
> {code:java}
> val anySchema = StructType(StructField("anyName", StringType, nullable = 
> false) :: Nil)
> val inputDF = spark.read.schema(anySchema).csv(inputFolderWithEmptyCSVFile)
> inputDF.write.parquet(outputFolderName)
> // Exception: org.apache.spark.sql.AnalysisException: Unable to infer schema 
> for Parquet. It must be specified manually.;
> val actualDF = spark.read.parquet(outputFolderName)
>  
> {code}
> *Actual:* Only "_SUCCESS" file in output directory
> *Expected*: at least one Parquet file with schema.
> Project for reproduce is attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21007) Add SQL function - RIGHT && LEFT

2018-02-06 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21007:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-20746

> Add  SQL function - RIGHT && LEFT
> -
>
> Key: SPARK-21007
> URL: https://issues.apache.org/jira/browse/SPARK-21007
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Major
> Fix For: 2.3.0
>
>
>  Add  SQL function - RIGHT && LEFT, same as MySQL:
>  https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_left
> https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_right



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14878) Support Trim characters in the string trim function

2018-02-06 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-14878:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-20746

> Support Trim characters in the string trim function
> ---
>
> Key: SPARK-14878
> URL: https://issues.apache.org/jira/browse/SPARK-14878
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: kevin yu
>Assignee: kevin yu
>Priority: Major
> Fix For: 2.3.0
>
>
> The current Spark SQL does not support the trim characters in the string trim 
> function, which is part of ANSI SQL2003’s standard. For example, IBM DB2 
> fully supports it as shown in the 
> https://www.ibm.com/support/knowledgecenter/SS6NHC/com.ibm.swg.im.dashdb.sql.ref.doc/doc/r0023198.html.
>  We propose to implement it in this JIRA..
> The ANSI SQL2003's trim Syntax:
> {noformat}
> SQL
>  ::= TRIM   
>  ::= [ [  ] [  ] FROM ] 
> 
>  ::= 
>  ::=
>   LEADING
> | TRAILING
> | BOTH
>  ::= 
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Deleted] (SPARK-20752) Build-in SQL Function Support - SQRT

2018-02-06 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li deleted SPARK-20752:



> Build-in SQL Function Support - SQRT
> 
>
> Key: SPARK-20752
> URL: https://issues.apache.org/jira/browse/SPARK-20752
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> {noformat}
> SQRT()
> {noformat}
> Returns Power(, 2)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22829) Add new built-in function date_trunc()

2018-02-06 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22829:

Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-20746

> Add new built-in function date_trunc()
> --
>
> Key: SPARK-22829
> URL: https://issues.apache.org/jira/browse/SPARK-22829
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Youngbin Kim
>Assignee: Youngbin Kim
>Priority: Major
> Fix For: 2.3.0
>
>
> Adding 'date_trunc' as a built-in function.
> 'date_trunc' is common in other databases, but Spark or Hive does not have 
> support for this.
> We do have 'trunc' but this only works with 'MONTH' and 'YEAR' level on the 
> DateType input. ( 
> https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/sql/functions.html#trunc-org.apache.spark.sql.Column-java.lang.String-
>  )
> 'date_trunc' on other databases:
> AWS Redshift: http://docs.aws.amazon.com/redshift/latest/dg/r_DATE_TRUNC.html
> PostgreSQL: https://www.postgresql.org/docs/9.1/static/functions-datetime.html
> Presto: https://prestodb.io/docs/current/functions/datetime.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Deleted] (SPARK-20747) Distinct in Aggregate Functions

2018-02-06 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li deleted SPARK-20747:



> Distinct in Aggregate Functions
> ---
>
> Key: SPARK-20747
> URL: https://issues.apache.org/jira/browse/SPARK-20747
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Xiao Li
>Priority: Major
>
> {noformat}
> AVG ([DISTINCT]|[ALL] )
> MAX ([DISTINCT]|[ALL] )
> MIN ([DISTINCT]|[ALL] )
> SUM ([DISTINCT]|[ALL] )
> {noformat}
> Except COUNT, the DISTINCT clause is not supported by Spark SQL



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20746) Built-in SQL Function Improvement

2018-02-06 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-20746:
---

Assignee: (was: Yuming Wang)

> Built-in SQL Function Improvement
> -
>
> Key: SPARK-20746
> URL: https://issues.apache.org/jira/browse/SPARK-20746
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Priority: Major
> Fix For: 2.3.0
>
>
> SQL functions are part of the core of the ISO/ANSI standards. This umbrella 
> JIRA is trying to list all the ISO/ANS SQL functions that are not fully 
> implemented by Spark SQL, fix the documentation and test case issues in the 
> supported functions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16060) Vectorized ORC reader

2018-02-06 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-16060:

Summary: Vectorized ORC reader  (was: Vectorized Orc reader)

> Vectorized ORC reader
> -
>
> Key: SPARK-16060
> URL: https://issues.apache.org/jira/browse/SPARK-16060
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1
>Reporter: Liang-Chi Hsieh
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: release-notes, releasenotes
> Fix For: 2.3.0
>
>
> Currently Orc reader in Spark SQL doesn't support vectorized reading. As Hive 
> Orc already support vectorization, we should add this support to improve Orc 
> reading performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23139) Read eventLog file with mixed encodings

2018-02-06 Thread DENG FEI (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355022#comment-16355022
 ] 

DENG FEI edited comment on SPARK-23139 at 2/7/18 6:34 AM:
--

[~irashid] 

You're right, but one can change the default character set in _'spark.driver / 
executor.extraJavaOptions'_ by setting _'-Dfile.encoding = ***'._

This should not be limiting.


was (Author: deng fei):
[~irashid] 

You're right, but one can change the default character set in _'spark.driver / 
executor.extraJavaOptions'_ by setting _'-Dfile.encoding = ***'._

__This should not be limiting.

> Read eventLog file with mixed encodings
> ---
>
> Key: SPARK-23139
> URL: https://issues.apache.org/jira/browse/SPARK-23139
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: DENG FEI
>Priority: Major
>
> EventLog may contain mixed encodings such as custom exception message, but 
> caused to replay failure.
> java.nio.charset.MalformedInputException: Input length = 1
> at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
>  at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
>  at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>  at java.io.InputStreamReader.read(InputStreamReader.java:184)
>  at java.io.BufferedReader.fill(BufferedReader.java:161)
>  at java.io.BufferedReader.readLine(BufferedReader.java:324)
>  at java.io.BufferedReader.readLine(BufferedReader.java:389)
>  at 
> scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72)
>  at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:836)
>  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>  at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:78)
>  at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:694)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:507)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$4$$anon$4.run(FsHistoryProvider.scala:399)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23139) Read eventLog file with mixed encodings

2018-02-06 Thread DENG FEI (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355022#comment-16355022
 ] 

DENG FEI commented on SPARK-23139:
--

[~irashid] 

You're right, but one can change the default character set in _'spark.driver / 
executor.extraJavaOptions'_ by setting _'-Dfile.encoding = ***'._

__This should not be limiting.

> Read eventLog file with mixed encodings
> ---
>
> Key: SPARK-23139
> URL: https://issues.apache.org/jira/browse/SPARK-23139
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: DENG FEI
>Priority: Major
>
> EventLog may contain mixed encodings such as custom exception message, but 
> caused to replay failure.
> java.nio.charset.MalformedInputException: Input length = 1
> at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
>  at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
>  at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>  at java.io.InputStreamReader.read(InputStreamReader.java:184)
>  at java.io.BufferedReader.fill(BufferedReader.java:161)
>  at java.io.BufferedReader.readLine(BufferedReader.java:324)
>  at java.io.BufferedReader.readLine(BufferedReader.java:389)
>  at 
> scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72)
>  at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:836)
>  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>  at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:78)
>  at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:694)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:507)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$4$$anon$4.run(FsHistoryProvider.scala:399)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23345) Flaky test: FileBasedDataSourceSuite

2018-02-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23345:


Assignee: Apache Spark

> Flaky test: FileBasedDataSourceSuite
> 
>
> Key: SPARK-23345
> URL: https://issues.apache.org/jira/browse/SPARK-23345
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Major
>
> See at least once:
> https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87114/testReport/junit/org.apache.spark.sql/FileBasedDataSourceSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/
> {noformat}
> Error Message
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 650 times over 10.011358752 
> seconds. Last failure message: There are 1 possibly leaked file streams..
> Stacktrace
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 650 times over 10.011358752 
> seconds. Last failure message: There are 1 possibly leaked file streams..
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:26)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:337)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:26)
>   at 
> org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:114)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.afterEach(FileBasedDataSourceSuite.scala:26)
>   at 
> org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234)
> {noformat}
> Log file is too large to attach here, though.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23345) Flaky test: FileBasedDataSourceSuite

2018-02-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23345:


Assignee: (was: Apache Spark)

> Flaky test: FileBasedDataSourceSuite
> 
>
> Key: SPARK-23345
> URL: https://issues.apache.org/jira/browse/SPARK-23345
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> See at least once:
> https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87114/testReport/junit/org.apache.spark.sql/FileBasedDataSourceSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/
> {noformat}
> Error Message
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 650 times over 10.011358752 
> seconds. Last failure message: There are 1 possibly leaked file streams..
> Stacktrace
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 650 times over 10.011358752 
> seconds. Last failure message: There are 1 possibly leaked file streams..
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:26)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:337)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:26)
>   at 
> org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:114)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.afterEach(FileBasedDataSourceSuite.scala:26)
>   at 
> org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234)
> {noformat}
> Log file is too large to attach here, though.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23345) Flaky test: FileBasedDataSourceSuite

2018-02-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355005#comment-16355005
 ] 

Apache Spark commented on SPARK-23345:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/20524

> Flaky test: FileBasedDataSourceSuite
> 
>
> Key: SPARK-23345
> URL: https://issues.apache.org/jira/browse/SPARK-23345
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> See at least once:
> https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87114/testReport/junit/org.apache.spark.sql/FileBasedDataSourceSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/
> {noformat}
> Error Message
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 650 times over 10.011358752 
> seconds. Last failure message: There are 1 possibly leaked file streams..
> Stacktrace
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 650 times over 10.011358752 
> seconds. Last failure message: There are 1 possibly leaked file streams..
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:26)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:337)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:26)
>   at 
> org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:114)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.afterEach(FileBasedDataSourceSuite.scala:26)
>   at 
> org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234)
> {noformat}
> Log file is too large to attach here, though.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2

2018-02-06 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354967#comment-16354967
 ] 

Wenchen Fan commented on SPARK-23309:
-

is it possible to provide a concrete query(with table schema) to demonstrate 
the performance regression? By looking at the code I can't find any potential 
places that may contribute to this regression. We need to do some profile and 
this issue may be caused by something else(e.g. aggregate), not the cache.

> Spark 2.3 cached query performance 20-30% worse then spark 2.2
> --
>
> Key: SPARK-23309
> URL: https://issues.apache.org/jira/browse/SPARK-23309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Blocker
>
> I was testing spark 2.3 rc2 and I am seeing a performance regression in sql 
> queries on cached data.
> The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 
> partitions
> Here is the example query:
> val dailycached = spark.sql("select something from table where dt = 
> '20170301' AND something IS NOT NULL")
> dailycached.createOrReplaceTempView("dailycached") 
> spark.catalog.cacheTable("dailyCached")
> spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show()
>  
> On spark 2.2 I see queries times average 13 seconds
> On the same nodes I see spark 2.3 queries times average 17 seconds
> Note these are times of queries after the initial caching.  so just running 
> the last line again: 
> spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() 
> multiple times.
>  
> I also ran a query over more data (335GB input/587.5 GB cached) and saw a 
> similar discrepancy in the performance of querying cached data between spark 
> 2.3 and spark 2.2, where 2.2 was better by like 20%.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23347) Introduce buffer between Java data stream and gzip stream

2018-02-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354962#comment-16354962
 ] 

Sean Owen commented on SPARK-23347:
---

Well, that would depend on what Jackson does, and that's buried pretty deep, 
but I'd guess it's the smart thing. The efficiency of serializing 4 bytes just 
can't vary that much though.

As I say, GZipOutputStream is already buffered; why would an extra buffer help?

> Introduce buffer between Java data stream and gzip stream
> -
>
> Key: SPARK-23347
> URL: https://issues.apache.org/jira/browse/SPARK-23347
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Ted Yu
>Priority: Minor
>
> Currently GZIPOutputStream is used directly around ByteArrayOutputStream 
> e.g. from KVStoreSerializer :
> {code}
>   ByteArrayOutputStream bytes = new ByteArrayOutputStream();
>   GZIPOutputStream out = new GZIPOutputStream(bytes);
> {code}
> This seems inefficient.
> GZIPOutputStream does not implement the write(byte) method. It only provides 
> a write(byte[], offset, len) method, which calls the corresponding JNI zlib 
> function.
> BufferedOutputStream can be introduced wrapping GZIPOutputStream for better 
> performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23347) Introduce buffer between Java data stream and gzip stream

2018-02-06 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354939#comment-16354939
 ] 

Ted Yu commented on SPARK-23347:


{code}
  public final byte[] serialize(Object o) throws Exception {
{code}
when o is an integer, I assume.

Looked at ObjectMapper.java source code where generator is involved.
Need to dig some more.

> Introduce buffer between Java data stream and gzip stream
> -
>
> Key: SPARK-23347
> URL: https://issues.apache.org/jira/browse/SPARK-23347
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Ted Yu
>Priority: Minor
>
> Currently GZIPOutputStream is used directly around ByteArrayOutputStream 
> e.g. from KVStoreSerializer :
> {code}
>   ByteArrayOutputStream bytes = new ByteArrayOutputStream();
>   GZIPOutputStream out = new GZIPOutputStream(bytes);
> {code}
> This seems inefficient.
> GZIPOutputStream does not implement the write(byte) method. It only provides 
> a write(byte[], offset, len) method, which calls the corresponding JNI zlib 
> function.
> BufferedOutputStream can be introduced wrapping GZIPOutputStream for better 
> performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23347) Introduce buffer between Java data stream and gzip stream

2018-02-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354919#comment-16354919
 ] 

Sean Owen commented on SPARK-23347:
---

Yes, but that's the opposite of what this JIRA suggests is a problem. This is 
only a problem if write(byte) is called instead of the bulk write method. Where 
are you suggesting that happens?

> Introduce buffer between Java data stream and gzip stream
> -
>
> Key: SPARK-23347
> URL: https://issues.apache.org/jira/browse/SPARK-23347
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Ted Yu
>Priority: Minor
>
> Currently GZIPOutputStream is used directly around ByteArrayOutputStream 
> e.g. from KVStoreSerializer :
> {code}
>   ByteArrayOutputStream bytes = new ByteArrayOutputStream();
>   GZIPOutputStream out = new GZIPOutputStream(bytes);
> {code}
> This seems inefficient.
> GZIPOutputStream does not implement the write(byte) method. It only provides 
> a write(byte[], offset, len) method, which calls the corresponding JNI zlib 
> function.
> BufferedOutputStream can be introduced wrapping GZIPOutputStream for better 
> performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23347) Introduce buffer between Java data stream and gzip stream

2018-02-06 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354907#comment-16354907
 ] 

Ted Yu commented on SPARK-23347:


See JDK 1.8 code:
{code}
class DeflaterOutputStream {
public void write(int b) throws IOException {
byte[] buf = new byte[1];
buf[0] = (byte)(b & 0xff);
write(buf, 0, 1);
}

public void write(byte[] b, int off, int len) throws IOException {
if (def.finished()) {
throw new IOException("write beyond end of stream");
}
if ((off | len | (off + len) | (b.length - (off + len))) < 0) {
throw new IndexOutOfBoundsException();
} else if (len == 0) {
return;
}
if (!def.finished()) {
def.setInput(b, off, len);
while (!def.needsInput()) {
deflate();
}
}
}
}

class GZIPOutputStream extends DeflaterOutputStream {
public synchronized void write(byte[] buf, int off, int len)
throws IOException
{
super.write(buf, off, len);
crc.update(buf, off, len);
}
}

class Deflater {
private native int deflateBytes(long addr, byte[] b, int off, int len, int 
flush);
}

class CRC32 {
public void update(byte[] b, int off, int len) {
if (b == null) {
throw new NullPointerException();
}
if (off < 0 || len < 0 || off > b.length - len) {
throw new ArrayIndexOutOfBoundsException();
}
crc = updateBytes(crc, b, off, len);
}

private native static int updateBytes(int crc, byte[] b, int off, int len);
}
{code}
For each data byte, the code above has to allocate 1 single byte array, acquire 
several locks, call two native JNI methods (Deflater.deflateBytes and 
CRC32.updateBytes). 

> Introduce buffer between Java data stream and gzip stream
> -
>
> Key: SPARK-23347
> URL: https://issues.apache.org/jira/browse/SPARK-23347
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Ted Yu
>Priority: Minor
>
> Currently GZIPOutputStream is used directly around ByteArrayOutputStream 
> e.g. from KVStoreSerializer :
> {code}
>   ByteArrayOutputStream bytes = new ByteArrayOutputStream();
>   GZIPOutputStream out = new GZIPOutputStream(bytes);
> {code}
> This seems inefficient.
> GZIPOutputStream does not implement the write(byte) method. It only provides 
> a write(byte[], offset, len) method, which calls the corresponding JNI zlib 
> function.
> BufferedOutputStream can be introduced wrapping GZIPOutputStream for better 
> performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23347) Introduce buffer between Java data stream and gzip stream

2018-02-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354903#comment-16354903
 ] 

Sean Owen commented on SPARK-23347:
---

GZipOutputStream is buffered already. As you say it implements the bulk write 
operation, not the single byte write. That's fine. The opposite is the problem 
for performance. This is especially not a problem in the case the output is 
already also buffered. I think this should be closed as a mistake.

> Introduce buffer between Java data stream and gzip stream
> -
>
> Key: SPARK-23347
> URL: https://issues.apache.org/jira/browse/SPARK-23347
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Ted Yu
>Priority: Minor
>
> Currently GZIPOutputStream is used directly around ByteArrayOutputStream 
> e.g. from KVStoreSerializer :
> {code}
>   ByteArrayOutputStream bytes = new ByteArrayOutputStream();
>   GZIPOutputStream out = new GZIPOutputStream(bytes);
> {code}
> This seems inefficient.
> GZIPOutputStream does not implement the write(byte) method. It only provides 
> a write(byte[], offset, len) method, which calls the corresponding JNI zlib 
> function.
> BufferedOutputStream can be introduced wrapping GZIPOutputStream for better 
> performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21097) Dynamic allocation will preserve cached data

2018-02-06 Thread Brad (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brad updated SPARK-21097:
-
Description: 
We want to use dynamic allocation to distribute resources among many notebook 
users on our spark clusters. One difficulty is that if a user has cached data 
then we are either prevented from de-allocating any of their executors, or we 
are forced to drop their cached data, which can lead to a bad user experience.

We propose adding a feature to preserve cached data by copying it to other 
executors before de-allocation. This behavior would be enabled by a simple 
spark config. Now when an executor reaches its configured idle timeout, instead 
of just killing it on the spot, we will stop sending it new tasks, replicate 
all of its rdd blocks onto other executors, and then kill it. If there is an 
issue while we replicate the data, like an error, it takes too long, or there 
isn't enough space, then we will fall back to the original behavior and drop 
the data and kill the executor.

This feature should allow anyone with notebook users to use their cluster 
resources more efficiently. Also since it will be completely opt-in it will 
unlikely to cause problems for other use cases.

  was:
We want to use dynamic allocation to distribute resources among many notebook 
users on our spark clusters. One difficulty is that if a user has cached data 
then we are either prevented from de-allocating any of their executors, or we 
are forced to drop their cached data, which can lead to a bad user experience.

We propose adding a feature to preserve cached data by copying it to other 
executors before de-allocation. This behavior would be enabled by a simple 
spark config like "spark.dynamicAllocation.recoverCachedData". Now when an 
executor reaches its configured idle timeout, instead of just killing it on the 
spot, we will stop sending it new tasks, replicate all of its rdd blocks onto 
other executors, and then kill it. If there is an issue while we replicate the 
data, like an error, it takes too long, or there isn't enough space, then we 
will fall back to the original behavior and drop the data and kill the executor.

This feature should allow anyone with notebook users to use their cluster 
resources more efficiently. Also since it will be completely opt-in it will 
unlikely to cause problems for other use cases. 



> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
>Priority: Major
> Attachments: Preserving Cached Data with Dynamic Allocation.pdf
>
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config. Now when an executor reaches its configured idle timeout, 
> instead of just killing it on the spot, we will stop sending it new tasks, 
> replicate all of its rdd blocks onto other executors, and then kill it. If 
> there is an issue while we replicate the data, like an error, it takes too 
> long, or there isn't enough space, then we will fall back to the original 
> behavior and drop the data and kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23346) Failed tasks reported as success if the failure reason is not ExceptionFailure

2018-02-06 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-23346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

吴志龙 updated SPARK-23346:

Description: 
 !企业微信截图_15179715023606.png!  !企业微信截图_15179714603307.png! We have many other 
failure reasons, such as TaskResultLost,but the status is success. In the web 
ui, we count non-ExceptionFailure failures as successful tasks, which is highly 
misleading.

detail message:
Job aborted due to stage failure: Task 0 in stage 7.0 failed 10 times, most 
recent failure: Lost task 0.9 in stage 7.0 (TID 35, 60.hadoop.com, executor 
27): TaskResultLost (result lost from block manager)

  was:
We have many other failure reasons, such as TaskResultLost,but the status is 
success. In the web ui, we count non-ExceptionFailure failures as successful 
tasks, which is highly misleading.

detail message:
Job aborted due to stage failure: Task 0 in stage 7.0 failed 10 times, most 
recent failure: Lost task 0.9 in stage 7.0 (TID 35, 60.hadoop.com, executor 
27): TaskResultLost (result lost from block manager)


> Failed tasks reported as success if the failure reason is not ExceptionFailure
> --
>
> Key: SPARK-23346
> URL: https://issues.apache.org/jira/browse/SPARK-23346
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: HADOOP 2.6 + JDK1.8 + SPARK 2.2.0
>Reporter: 吴志龙
>Priority: Blocker
> Attachments: 企业微信截图_15179714603307.png, 企业微信截图_15179715023606.png
>
>
>  !企业微信截图_15179715023606.png!  !企业微信截图_15179714603307.png! We have many other 
> failure reasons, such as TaskResultLost,but the status is success. In the web 
> ui, we count non-ExceptionFailure failures as successful tasks, which is 
> highly misleading.
> detail message:
> Job aborted due to stage failure: Task 0 in stage 7.0 failed 10 times, most 
> recent failure: Lost task 0.9 in stage 7.0 (TID 35, 60.hadoop.com, executor 
> 27): TaskResultLost (result lost from block manager)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23346) Failed tasks reported as success if the failure reason is not ExceptionFailure

2018-02-06 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-23346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

吴志龙 updated SPARK-23346:

Attachment: 企业微信截图_15179715023606.png

> Failed tasks reported as success if the failure reason is not ExceptionFailure
> --
>
> Key: SPARK-23346
> URL: https://issues.apache.org/jira/browse/SPARK-23346
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: HADOOP 2.6 + JDK1.8 + SPARK 2.2.0
>Reporter: 吴志龙
>Priority: Blocker
> Attachments: 企业微信截图_15179714603307.png, 企业微信截图_15179715023606.png
>
>
> We have many other failure reasons, such as TaskResultLost,but the status is 
> success. In the web ui, we count non-ExceptionFailure failures as successful 
> tasks, which is highly misleading.
> detail message:
> Job aborted due to stage failure: Task 0 in stage 7.0 failed 10 times, most 
> recent failure: Lost task 0.9 in stage 7.0 (TID 35, 60.hadoop.com, executor 
> 27): TaskResultLost (result lost from block manager)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23346) Failed tasks reported as success if the failure reason is not ExceptionFailure

2018-02-06 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-23346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

吴志龙 updated SPARK-23346:

Attachment: 企业微信截图_15179714603307.png

> Failed tasks reported as success if the failure reason is not ExceptionFailure
> --
>
> Key: SPARK-23346
> URL: https://issues.apache.org/jira/browse/SPARK-23346
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: HADOOP 2.6 + JDK1.8 + SPARK 2.2.0
>Reporter: 吴志龙
>Priority: Blocker
> Attachments: 企业微信截图_15179714603307.png, 企业微信截图_15179715023606.png
>
>
> We have many other failure reasons, such as TaskResultLost,but the status is 
> success. In the web ui, we count non-ExceptionFailure failures as successful 
> tasks, which is highly misleading.
> detail message:
> Job aborted due to stage failure: Task 0 in stage 7.0 failed 10 times, most 
> recent failure: Lost task 0.9 in stage 7.0 (TID 35, 60.hadoop.com, executor 
> 27): TaskResultLost (result lost from block manager)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23346) Failed tasks reported as success if the failure reason is not ExceptionFailure

2018-02-06 Thread JIRA
吴志龙 created SPARK-23346:
---

 Summary: Failed tasks reported as success if the failure reason is 
not ExceptionFailure
 Key: SPARK-23346
 URL: https://issues.apache.org/jira/browse/SPARK-23346
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 2.2.0
 Environment: HADOOP 2.6 + JDK1.8 + SPARK 2.2.0
Reporter: 吴志龙


We have many other failure reasons, such as TaskResultLost,but the status is 
success. In the web ui, we count non-ExceptionFailure failures as successful 
tasks, which is highly misleading.

detail message:
Job aborted due to stage failure: Task 0 in stage 7.0 failed 10 times, most 
recent failure: Lost task 0.9 in stage 7.0 (TID 35, 60.hadoop.com, executor 
27): TaskResultLost (result lost from block manager)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23347) Introduce buffer between Java data stream and gzip stream

2018-02-06 Thread Ted Yu (JIRA)
Ted Yu created SPARK-23347:
--

 Summary: Introduce buffer between Java data stream and gzip stream
 Key: SPARK-23347
 URL: https://issues.apache.org/jira/browse/SPARK-23347
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Ted Yu


Currently GZIPOutputStream is used directly around ByteArrayOutputStream 
e.g. from KVStoreSerializer :
{code}
  ByteArrayOutputStream bytes = new ByteArrayOutputStream();
  GZIPOutputStream out = new GZIPOutputStream(bytes);
{code}
This seems inefficient.
GZIPOutputStream does not implement the write(byte) method. It only provides a 
write(byte[], offset, len) method, which calls the corresponding JNI zlib 
function.

BufferedOutputStream can be introduced wrapping GZIPOutputStream for better 
performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17217) Codegeneration fails for describe() on many columns

2018-02-06 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354848#comment-16354848
 ] 

Xiao Li commented on SPARK-17217:
-

It should be resolved by https://issues.apache.org/jira/browse/SPARK-22510. If 
not, please re-open it.

> Codegeneration fails for describe() on many columns
> ---
>
> Key: SPARK-17217
> URL: https://issues.apache.org/jira/browse/SPARK-17217
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Kalle Jepsen
>Priority: Major
>
> Consider the following minimal python script:
> {code:python}
> import pyspark
> from pyspark.sql import functions as F
> conf = pyspark.SparkConf()
> sc = pyspark.SparkContext(conf=conf)
> spark = pyspark.sql.SQLContext(sc)
> ncols = 510
> nrows = 10
> df = spark.range(0, nrows)
> s = df.select(
> [
> F.randn(seed=i).alias('C%i' % i) for i in range(ncols)
> ]
> ).describe()
> {code}
> This fails with a traceback counting 3.6M (!) lines for {{ncols >= 510}}, 
> saying something like
> {noformat}
> 16/08/24 16:50:57 ERROR CodeGenerator: failed to compile: java.io.EOFException
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificMutableProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
> ...
> /* 7372 */   private boolean isNull_1969;
> /* 7373 */   private double value_1969;
> /* 7374 */   private boolean isNull_1970;
> ...
> /* 11035 */   double value14944 = -1.0;
> /* 11036 */
> /* 11037 */
> /* 11038 */   if (!evalExpr1052IsNull) {
> /* 11039 */
> /* 11040 */ isNull14944 = false; // resultCode could change 
> nullability.
> /* 11041 */ value14944 = evalExpr1326Value - evalExpr1052Value;
> /* 11042 */
> ...
> /* 157621 */ apply1_6(i);
> /* 157622 */ return mutableRow;
> /* 157623 */   }
> /* 157624 */ }
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:889)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:941)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:938)
>   at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>   ... 30 more
> Caused by: java.io.EOFException
>   at java.io.DataInputStream.readFully(DataInputStream.java:197)
>   at java.io.DataInputStream.readFully(DataInputStream.java:169)
>   at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1383)
>   at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:555)
>   at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:518)
>   at org.codehaus.janino.util.ClassFile.(ClassFile.java:185)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:914)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:912)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:912)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:884)
>   ... 35 more
> {noformat}
> I've seen something similar in an earlier Spark version ([reported in this 
> issue|https://issues.apache.org/jira/browse/SPARK-14138]).
> My conclusion is that {{describe}} was never meant to be used 
> non-interactively on very wide dataframes, am I right?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17217) Codegeneration fails for describe() on many columns

2018-02-06 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-17217.
-
Resolution: Duplicate

> Codegeneration fails for describe() on many columns
> ---
>
> Key: SPARK-17217
> URL: https://issues.apache.org/jira/browse/SPARK-17217
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Kalle Jepsen
>Priority: Major
>
> Consider the following minimal python script:
> {code:python}
> import pyspark
> from pyspark.sql import functions as F
> conf = pyspark.SparkConf()
> sc = pyspark.SparkContext(conf=conf)
> spark = pyspark.sql.SQLContext(sc)
> ncols = 510
> nrows = 10
> df = spark.range(0, nrows)
> s = df.select(
> [
> F.randn(seed=i).alias('C%i' % i) for i in range(ncols)
> ]
> ).describe()
> {code}
> This fails with a traceback counting 3.6M (!) lines for {{ncols >= 510}}, 
> saying something like
> {noformat}
> 16/08/24 16:50:57 ERROR CodeGenerator: failed to compile: java.io.EOFException
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificMutableProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
> ...
> /* 7372 */   private boolean isNull_1969;
> /* 7373 */   private double value_1969;
> /* 7374 */   private boolean isNull_1970;
> ...
> /* 11035 */   double value14944 = -1.0;
> /* 11036 */
> /* 11037 */
> /* 11038 */   if (!evalExpr1052IsNull) {
> /* 11039 */
> /* 11040 */ isNull14944 = false; // resultCode could change 
> nullability.
> /* 11041 */ value14944 = evalExpr1326Value - evalExpr1052Value;
> /* 11042 */
> ...
> /* 157621 */ apply1_6(i);
> /* 157622 */ return mutableRow;
> /* 157623 */   }
> /* 157624 */ }
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:889)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:941)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:938)
>   at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>   ... 30 more
> Caused by: java.io.EOFException
>   at java.io.DataInputStream.readFully(DataInputStream.java:197)
>   at java.io.DataInputStream.readFully(DataInputStream.java:169)
>   at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1383)
>   at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:555)
>   at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:518)
>   at org.codehaus.janino.util.ClassFile.(ClassFile.java:185)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:914)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:912)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:912)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:884)
>   ... 35 more
> {noformat}
> I've seen something similar in an earlier Spark version ([reported in this 
> issue|https://issues.apache.org/jira/browse/SPARK-14138]).
> My conclusion is that {{describe}} was never meant to be used 
> non-interactively on very wide dataframes, am I right?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2018-02-06 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-18016:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-22510

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: 910825_9.zip
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345)
>   at 
> 

[jira] [Updated] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2018-02-06 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-16845:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-22510

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: hejie
>Assignee: Liwei Lin
>Priority: Major
> Fix For: 1.6.4, 2.0.3, 2.1.1, 2.2.0
>
> Attachments: error.txt.zip
>
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22226) splitExpression can create too many method calls (generating a Constant Pool limit error)

2018-02-06 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-6:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-22510

> splitExpression can create too many method calls (generating a Constant Pool 
> limit error)
> -
>
> Key: SPARK-6
> URL: https://issues.apache.org/jira/browse/SPARK-6
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.0
>
>
> Code generation for very wide datasets can fail because of the Constant Pool 
> limit reached.
> This can be caused by many reasons. One of them is that we are currently 
> splitting the definition of the generated methods among several 
> {{NestedClass}} but all these methods are called in the main class. Since we 
> have entries added to the constant pool for each method invocation, this is 
> limiting the number of rows and is leading for very wide dataset to:
> {noformat}
> org.codehaus.janino.JaninoRuntimeException: Constant pool for class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection
>  has grown past JVM limit of 0x
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23122) Deprecate register* for UDFs in SQLContext and Catalog in PySpark

2018-02-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354813#comment-16354813
 ] 

Apache Spark commented on SPARK-23122:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/20523

> Deprecate register* for UDFs in SQLContext and Catalog in PySpark
> -
>
> Key: SPARK-23122
> URL: https://issues.apache.org/jira/browse/SPARK-23122
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.3.0
>
>
> Deprecate register* for UDFs in SQLContext and Catalog in PySpark
> Seems we allow many other ways to register UDFs in SQL statements. Some are 
> in {{SQLContext}} and some are in {{Catalog}}.
> These are also inconsistent with Java / Scala APIs. Seems we better 
> deprecating them and put the logics into {{UDFRegistration}} 
> ({{spark.udf.register*}}).
> Please see this discussion too - 
> [https://github.com/apache/spark/pull/20217#issuecomment-357134926].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23313) Add a migration guide for ORC

2018-02-06 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23313:

Target Version/s: 2.3.0

> Add a migration guide for ORC
> -
>
> Key: SPARK-23313
> URL: https://issues.apache.org/jira/browse/SPARK-23313
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23327) Update the description of three external API or functions

2018-02-06 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23327.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Update the description of three external API or functions
> -
>
> Key: SPARK-23327
> URL: https://issues.apache.org/jira/browse/SPARK-23327
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.3.0
>
>
> Update the description of three external API or functions `createFunction `, 
> `length` and `repartitionByRange `



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22158) convertMetastore should not ignore storage properties

2018-02-06 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22158:

Summary: convertMetastore should not ignore storage properties  (was: 
convertMetastore should not ignore table properties)

> convertMetastore should not ignore storage properties
> -
>
> Key: SPARK-22158
> URL: https://issues.apache.org/jira/browse/SPARK-22158
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.2.1, 2.3.0
>
>
> From the beginning, convertMetastoreOrc ignores table properties and use an 
> emtpy map instead. It's the same with convertMetastoreParquet.
> {code}
> val options = Map[String, String]()
> {code}
> - SPARK-14070: 
> https://github.com/apache/spark/pull/11891/files#diff-ee66e11b56c21364760a5ed2b783f863R650
> - master: 
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L197



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException

2018-02-06 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354784#comment-16354784
 ] 

Steve Loughran edited comment on SPARK-23308 at 2/7/18 12:26 AM:
-

bq. I have not heard this come up before as an issue in another implementation.

S3A's input stream handles an IOE other than EOF with a: increment metrics, 
close the stream, retry once; generally that causes the error to be recovered 
from. If not, you are into the unrecoverable-network-problems kind of problem, 
except for the special case of "you are recycling the pool of HTTP connections 
and should abort that TCP connection before trying anything else". I think 
there are opportunities to improve S3A there by aborting the connection before 
retrying.

I don't think Spark is in the position to  be clever about retries, as it's too 
low-level as to what is retryable vs not; it would need a policy for all 
possible exceptions from all known FS clients and split them into "we can 
recover" from "no, fail fast"

Trying to come up with a good policy is (a) something the FS clients should be 
doing and (b) really hard to get right in the absence of frequent failures; it 
is usually evolution based on bug reports. For example 
[S3ARetryPolicy|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3ARetryPolicy.java#L87]
 is very much a WiP (HADOOP-14531).

Marcio: surprised you are getting so many socket timeouts. If this is happening 
in EC2 it's *potentially* throttling related; overloaded connection pools raise 
ConnectionPoolTimeoutException, apparently.


was (Author: ste...@apache.org):
bq. I have not heard this come up before as an issue in another implementation.

S3A's input stream handles an IOE other than EOF with a: increment metrics, 
close the stream, retry once; generally that causes the error to be recovered 
from. If not, you are into the unrecoverable-network-problems kind of problem, 
except for the special case of "you are recycling the pool of HTTP connections 
and should abort that TCP connection before trying anything else". I think 
there are opportunities to improve S3A there by aborting the connection before 
retrying.

I don't think Spark is in the position to  be clever about retries, as its too 
low-level as to what is retryable vs not; it would need a policy for all 
possible exceptions from all known FS clients and split them into "we can 
recover" from "no, fail fast"

Trying to come up with a good policy is (a) something the FS clients should be 
doing and (b) really hard to get right in the absence of frequent failures; its 
usually evolution based on bug reports. For example 
[S3ARetryPolicy|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3ARetryPolicy.java#L87]
 is very much a WiP (HADOOP-14531).

Marcio: surprised you are getting so many socket timeouts. If this is happening 
in EC2 it's *potentially* throttling related; overloaded connection pools raise 
ConnectionPoolTimeoutException, apparently.

> ignoreCorruptFiles should not ignore retryable IOException
> --
>
> Key: SPARK-23308
> URL: https://issues.apache.org/jira/browse/SPARK-23308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Márcio Furlani Carmona
>Priority: Minor
>
> When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind 
> of RuntimeException or IOException, but some possible IOExceptions may happen 
> even if the file is not corrupted.
> One example is the SocketTimeoutException which can be retried to possibly 
> fetch the data without meaning the data is corrupted.
>  
> See: 
> https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException

2018-02-06 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354784#comment-16354784
 ] 

Steve Loughran commented on SPARK-23308:


bq. I have not heard this come up before as an issue in another implementation.

S3A's input stream handles an IOE other than EOF with a: increment metrics, 
close the stream, retry once; generally that causes the error to be recovered 
from. If not, you are into the unrecoverable-network-problems kind of problem, 
except for the special case of "you are recycling the pool of HTTP connections 
and should abort that TCP connection before trying anything else". I think 
there are opportunities to improve S3A there by aborting the connection before 
retrying.

I don't think Spark is in the position to  be clever about retries, as its too 
low-level as to what is retryable vs not; it would need a policy for all 
possible exceptions from all known FS clients and split them into "we can 
recover" from "no, fail fast"

Trying to come up with a good policy is (a) something the FS clients should be 
doing and (b) really hard to get right in the absence of frequent failures; its 
usually evolution based on bug reports. For example 
[S3ARetryPolicy|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3ARetryPolicy.java#L87]
 is very much a WiP (HADOOP-14531).

Marcio: surprised you are getting so many socket timeouts. If this is happening 
in EC2 it's *potentially* throttling related; overloaded connection pools raise 
ConnectionPoolTimeoutException, apparently.

> ignoreCorruptFiles should not ignore retryable IOException
> --
>
> Key: SPARK-23308
> URL: https://issues.apache.org/jira/browse/SPARK-23308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Márcio Furlani Carmona
>Priority: Minor
>
> When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind 
> of RuntimeException or IOException, but some possible IOExceptions may happen 
> even if the file is not corrupted.
> One example is the SocketTimeoutException which can be retried to possibly 
> fetch the data without meaning the data is corrupted.
>  
> See: 
> https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19870) Repeatable deadlock on BlockInfoManager and TorrentBroadcast

2018-02-06 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354680#comment-16354680
 ] 

Imran Rashid commented on SPARK-19870:
--

[~eyalfa] my recollection is a bit rusty, but you're explanation sounds 
reasonable.  Pretty sure if SPARK–22083 is the cause, then you'll see some 
other exception on the executor before this (perhaps it even appears to be 
handled gracefully at that point), so you should take a look at executor logs.  
If there is no exception, probably not SPARK-22083.  still executor logs would 
be really helpful to figure out what it might be.  As you mentioned earlier, if 
this is reproducible, then its worth even turning on TRACE and just grabbing 
those logs.

> Repeatable deadlock on BlockInfoManager and TorrentBroadcast
> 
>
> Key: SPARK-19870
> URL: https://issues.apache.org/jira/browse/SPARK-19870
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Shuffle
>Affects Versions: 2.0.2, 2.1.0
> Environment: ubuntu linux 14.04 x86_64 on ec2, hadoop cdh 5.10.0, 
> yarn coarse-grained.
>Reporter: Steven Ruppert
>Priority: Major
> Attachments: stack.txt
>
>
> Running what I believe to be a fairly vanilla spark job, using the RDD api, 
> with several shuffles, a cached RDD, and finally a conversion to DataFrame to 
> save to parquet. I get a repeatable deadlock at the very last reducers of one 
> of the stages.
> Roughly:
> {noformat}
> "Executor task launch worker-6" #56 daemon prio=5 os_prio=0 
> tid=0x7fffd88d3000 nid=0x1022b9 waiting for monitor entry 
> [0x7fffb95f3000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:207)
> - waiting to lock <0x0005445cfc00> (a 
> org.apache.spark.broadcast.TorrentBroadcast$)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
> - locked <0x0005b12f2290> (a 
> org.apache.spark.broadcast.TorrentBroadcast)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> and 
> {noformat}
> "Executor task launch worker-5" #55 daemon prio=5 os_prio=0 
> tid=0x7fffd88d nid=0x1022b8 in Object.wait() [0x7fffb96f4000]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at 
> org.apache.spark.storage.BlockInfoManager.lockForReading(BlockInfoManager.scala:202)
> - locked <0x000545736b58> (a 
> org.apache.spark.storage.BlockInfoManager)
> at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:444)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210)
> - locked <0x0005445cfc00> (a 
> org.apache.spark.broadcast.TorrentBroadcast$)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
> - locked <0x00059711eb10> (a 
> org.apache.spark.broadcast.TorrentBroadcast)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at 

[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-06 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354665#comment-16354665
 ] 

Li Jin commented on SPARK-23314:


I figured out what the issue is. Will have a patch soon.

> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> File "pandas/_libs/tslib.pyx", line 3593, in 
> pandas._libs.tslib.tz_localize_to_utc
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For details, see Comment box. I'm able to reproduce this on the latest 
> branch-2.3 (last change from Feb 1 UTC)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20659) Remove StorageStatus, or make it private.

2018-02-06 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354628#comment-16354628
 ] 

Marcelo Vanzin commented on SPARK-20659:


I haven't really looked in detail at this stuff, so I don't really have many 
answers, but...

bq. In the new method of SparkStatusTracker is it fine just to give back an 
array of RDDStorageInfo from the store?

That type does not seem to be tracking the same thing as {{StorageStatus}}. 
There doesn't seem to be an immediate equivalent type in the REST API. So I'd 
keep it simple - expose a few extra things like memory metrics in 
{{SparkExecutorInfo}}, for example, and wait until someone asks for more 
information about individual RDDs to take action.

[~joshrosen] added the original {{getExecutorStorageStatus}} method so he might 
have more background on what it was trying to expose.

bq. And for the metrics to pass BlockManagerInfo array via the 
BlockManagerMasterEndpoint

Not sure what the question is. But the same data currently needed by that code 
path should be kept, maybe just using a different type (or a simplified, 
private version of {{StorageStatus}}). You have to find the callers of that RPC 
and what they need to figure out what needs to happen here.

> Remove StorageStatus, or make it private.
> -
>
> Key: SPARK-20659
> URL: https://issues.apache.org/jira/browse/SPARK-20659
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> With the work being done in SPARK-18085, StorageStatus is not used anymore by 
> the UI. It's still used in a couple of other places, though:
> - {{SparkContext.getExecutorStorageStatus}}
> - {{BlockManagerSource}} (a metrics source)
> Both could be changed to use the REST API types; the first one could be 
> replaced with a new method in {{SparkStatusTracker}}, which I also think is a 
> better place for it anyway.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10925) Exception when joining DataFrames

2018-02-06 Thread Clay Stevens (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354627#comment-16354627
 ] 

Clay Stevens commented on SPARK-10925:
--

I was having the same problem and " `df.rdd.toDF()` before using `join` " 
worked and I was then able to join an aggregated dataframe back to the 
dataframe it was derived from without this error.  Great.  I had to do this 
iterateively for many column pairs, then join all of the results together.  
This final set of joins would happen without error.

However, when I tried any action on the final dataframe, Yarn would kill all of 
the containers for exceeding memory limits.  We tried many various 
configuration settings changing `spark.dynamicAllocation.minExecutors`,  
`spark.executor.cores`,  `spark.executor.memory`,  
`spark.yarn.executor.memoryOverhead`,  and  `spark.driver.maxResultSize`.  None 
helped.  Our software engineers working with Cloudera determined the cause to 
be SPARK-21033   and  SPARK-22438  in Spark 2.2.0.    

Then I tried the same "`df.rdd.toDF()` before using `join`" trick and the 
memory error went away.  Are these issues related, or is the same error getting 
thrown many times in the background once an action is completed on a dataframe 
that has many joins, and I only see the result as an "Container killed by YARN 
for exceeding memory limits." error message?

> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
>Priority: Major
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase.scala, 
> TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 

[jira] [Commented] (SPARK-22158) convertMetastore should not ignore table properties

2018-02-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354525#comment-16354525
 ] 

Apache Spark commented on SPARK-22158:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/20522

> convertMetastore should not ignore table properties
> ---
>
> Key: SPARK-22158
> URL: https://issues.apache.org/jira/browse/SPARK-22158
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.2.1, 2.3.0
>
>
> From the beginning, convertMetastoreOrc ignores table properties and use an 
> emtpy map instead. It's the same with convertMetastoreParquet.
> {code}
> val options = Map[String, String]()
> {code}
> - SPARK-14070: 
> https://github.com/apache/spark/pull/11891/files#diff-ee66e11b56c21364760a5ed2b783f863R650
> - master: 
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L197



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23315) failed to get output from canonicalized data source v2 related plans

2018-02-06 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23315.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> failed to get output from canonicalized data source v2 related plans
> 
>
> Key: SPARK-23315
> URL: https://issues.apache.org/jira/browse/SPARK-23315
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19256) Hive bucketing support

2018-02-06 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354440#comment-16354440
 ] 

Tejas Patil commented on SPARK-19256:
-

[~ferdonline] : The feature you are referring to is not in the scope of this 
Jira as storing bucketing metadata outside of metastore is not something that 
hive does. Could you please write to dev@ and create a Jira to initiate this 
discussion ?

> Hive bucketing support
> --
>
> Key: SPARK-19256
> URL: https://issues.apache.org/jira/browse/SPARK-19256
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>Priority: Minor
>
> JIRA to track design discussions and tasks related to Hive bucketing support 
> in Spark.
> Proposal : 
> https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20659) Remove StorageStatus, or make it private.

2018-02-06 Thread Attila Zsolt Piros (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354430#comment-16354430
 ] 

Attila Zsolt Piros commented on SPARK-20659:


[~vanzin] I would like to ask a few questions:
- In the new method of SparkStatusTracker is it fine just to give back an array 
of RDDStorageInfo from the store?
- And for the metrics to pass BlockManagerInfo array via the 
BlockManagerMasterEndpoint and using only the maxMem, maxOnHeapMem,  
maxOffHeapMem counters?


> Remove StorageStatus, or make it private.
> -
>
> Key: SPARK-20659
> URL: https://issues.apache.org/jira/browse/SPARK-20659
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> With the work being done in SPARK-18085, StorageStatus is not used anymore by 
> the UI. It's still used in a couple of other places, though:
> - {{SparkContext.getExecutorStorageStatus}}
> - {{BlockManagerSource}} (a metrics source)
> Both could be changed to use the REST API types; the first one could be 
> replaced with a new method in {{SparkStatusTracker}}, which I also think is a 
> better place for it anyway.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18067) SortMergeJoin adds shuffle if join predicates have non partitioned columns

2018-02-06 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354429#comment-16354429
 ] 

Tejas Patil commented on SPARK-18067:
-

[~eyalfa] : 

Re #1: Theoretically this idea should help reducing shuffle while doing 
aggregations as well. I have not looked into implementation wise how much 
changes that would need. I think it would be good to have that. Will prefer 
keeping it separate from the current PR.

Re #2: This is interesting. Let me think more about this.

> SortMergeJoin adds shuffle if join predicates have non partitioned columns
> --
>
> Key: SPARK-18067
> URL: https://issues.apache.org/jira/browse/SPARK-18067
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Paul Jones
>Priority: Minor
>
> Basic setup
> {code}
> scala> case class Data1(key: String, value1: Int)
> scala> case class Data2(key: String, value2: Int)
> scala> val partition1 = sc.parallelize(1 to 10).map(x => Data1(s"$x", x))
> .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
> scala> val partition2 = sc.parallelize(1 to 10).map(x => Data2(s"$x", x))
> .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
> {code}
> Join on key
> {code}
> scala> partition1.join(partition2, "key").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- SortMergeJoin [key#0], [key#12]
>:- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>+- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> And we get a super efficient join with no shuffle.
> But if we add a filter our join gets less efficient and we end up with a 
> shuffle.
> {code}
> scala> partition1.join(partition2, "key").filter($"value1" === 
> $"value2").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- SortMergeJoin [value1#1,key#0], [value2#13,key#12]
>:- Sort [value1#1 ASC,key#0 ASC], false, 0
>:  +- TungstenExchange hashpartitioning(value1#1,key#0,200), None
>: +- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>+- Sort [value2#13 ASC,key#12 ASC], false, 0
>   +- TungstenExchange hashpartitioning(value2#13,key#12,200), None
>  +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> And we can avoid the shuffle if use a filter statement that can't be pushed 
> in the join.
> {code}
> scala> partition1.join(partition2, "key").filter($"value1" >= 
> $"value2").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- Filter (value1#1 >= value2#13)
>+- SortMergeJoin [key#0], [key#12]
>   :- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>   +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> What's the best way to avoid the filter pushdown here??



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23271) Parquet output contains only "_SUCCESS" file after empty DataFrame saving

2018-02-06 Thread Dilip Biswal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354384#comment-16354384
 ] 

Dilip Biswal edited comment on SPARK-23271 at 2/6/18 7:23 PM:
--

Thank you [~smilegator]. I will try to create a PR to fix this by trying to 
repartition the RDD before setting up the write job. 
We can discuss whether its the right approach to fix this issue in the PR.


was (Author: dkbiswal):
Thank you [~smilegator]. I will try to create a PR to fix this by trying to 
repartition the RDD before setting up the write job. We can discuss whether its 
the right approach in the PR.

> Parquet output contains only "_SUCCESS" file after empty DataFrame saving 
> --
>
> Key: SPARK-23271
> URL: https://issues.apache.org/jira/browse/SPARK-23271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Pavlo Z.
>Priority: Minor
> Attachments: parquet-empty-output.zip
>
>
> Sophisticated case, reproduced only if read empty CSV file without header 
> with assigned schema.
> Steps for reproduce (Scala):
> {code:java}
> val anySchema = StructType(StructField("anyName", StringType, nullable = 
> false) :: Nil)
> val inputDF = spark.read.schema(anySchema).csv(inputFolderWithEmptyCSVFile)
> inputDF.write.parquet(outputFolderName)
> // Exception: org.apache.spark.sql.AnalysisException: Unable to infer schema 
> for Parquet. It must be specified manually.;
> val actualDF = spark.read.parquet(outputFolderName)
>  
> {code}
> *Actual:* Only "_SUCCESS" file in output directory
> *Expected*: at least one Parquet file with schema.
> Project for reproduce is attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23271) Parquet output contains only "_SUCCESS" file after empty DataFrame saving

2018-02-06 Thread Dilip Biswal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354384#comment-16354384
 ] 

Dilip Biswal commented on SPARK-23271:
--

Thank you [~smilegator]. I will try to create a PR to fix this by trying to 
repartition the RDD before setting up the write job. We can discuss whether its 
the right approach in the PR.

> Parquet output contains only "_SUCCESS" file after empty DataFrame saving 
> --
>
> Key: SPARK-23271
> URL: https://issues.apache.org/jira/browse/SPARK-23271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Pavlo Z.
>Priority: Minor
> Attachments: parquet-empty-output.zip
>
>
> Sophisticated case, reproduced only if read empty CSV file without header 
> with assigned schema.
> Steps for reproduce (Scala):
> {code:java}
> val anySchema = StructType(StructField("anyName", StringType, nullable = 
> false) :: Nil)
> val inputDF = spark.read.schema(anySchema).csv(inputFolderWithEmptyCSVFile)
> inputDF.write.parquet(outputFolderName)
> // Exception: org.apache.spark.sql.AnalysisException: Unable to infer schema 
> for Parquet. It must be specified manually.;
> val actualDF = spark.read.parquet(outputFolderName)
>  
> {code}
> *Actual:* Only "_SUCCESS" file in output directory
> *Expected*: at least one Parquet file with schema.
> Project for reproduce is attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2

2018-02-06 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354336#comment-16354336
 ] 

Thomas Graves commented on SPARK-23309:
---

I pulled in that patch ([https://github.com/apache/spark/pull/20513]) and 
numbers got better but am still seeing 10% slower on 2.3.  (this is down from 
15%)

This is using the configs: --conf spark.sql.orc.impl=hive --conf 
spark.sql.orc.filterPushdown=true --conf 
spark.sql.hive.convertMetastoreOrc=false --conf 
spark.sql.inMemoryColumnarStorage.enableVectorizedReader=false

has anyone else reproduced this or is it only me seeing it?

> Spark 2.3 cached query performance 20-30% worse then spark 2.2
> --
>
> Key: SPARK-23309
> URL: https://issues.apache.org/jira/browse/SPARK-23309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Blocker
>
> I was testing spark 2.3 rc2 and I am seeing a performance regression in sql 
> queries on cached data.
> The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 
> partitions
> Here is the example query:
> val dailycached = spark.sql("select something from table where dt = 
> '20170301' AND something IS NOT NULL")
> dailycached.createOrReplaceTempView("dailycached") 
> spark.catalog.cacheTable("dailyCached")
> spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show()
>  
> On spark 2.2 I see queries times average 13 seconds
> On the same nodes I see spark 2.3 queries times average 17 seconds
> Note these are times of queries after the initial caching.  so just running 
> the last line again: 
> spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() 
> multiple times.
>  
> I also ran a query over more data (335GB input/587.5 GB cached) and saw a 
> similar discrepancy in the performance of querying cached data between spark 
> 2.3 and spark 2.2, where 2.2 was better by like 20%.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23345) Flaky test: FileBasedDataSourceSuite

2018-02-06 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-23345:
---
Component/s: Tests

> Flaky test: FileBasedDataSourceSuite
> 
>
> Key: SPARK-23345
> URL: https://issues.apache.org/jira/browse/SPARK-23345
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> See at least once:
> https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87114/testReport/junit/org.apache.spark.sql/FileBasedDataSourceSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/
> {noformat}
> Error Message
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 650 times over 10.011358752 
> seconds. Last failure message: There are 1 possibly leaked file streams..
> Stacktrace
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 650 times over 10.011358752 
> seconds. Last failure message: There are 1 possibly leaked file streams..
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:26)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:337)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:26)
>   at 
> org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:114)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.afterEach(FileBasedDataSourceSuite.scala:26)
>   at 
> org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234)
> {noformat}
> Log file is too large to attach here, though.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23345) Flaky test: FileBasedDataSourceSuite

2018-02-06 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-23345:
--

 Summary: Flaky test: FileBasedDataSourceSuite
 Key: SPARK-23345
 URL: https://issues.apache.org/jira/browse/SPARK-23345
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Marcelo Vanzin


See at least once:
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87114/testReport/junit/org.apache.spark.sql/FileBasedDataSourceSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/

{noformat}
Error Message
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
eventually never returned normally. Attempted 650 times over 10.011358752 
seconds. Last failure message: There are 1 possibly leaked file streams..

Stacktrace
sbt.ForkMain$ForkError: 
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
eventually never returned normally. Attempted 650 times over 10.011358752 
seconds. Last failure message: There are 1 possibly leaked file streams..
at 
org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421)
at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439)
at 
org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:26)
at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:337)
at 
org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:26)
at 
org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:114)
at 
org.apache.spark.sql.FileBasedDataSourceSuite.afterEach(FileBasedDataSourceSuite.scala:26)
at 
org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234)
{noformat}

Log file is too large to attach here, though.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException

2018-02-06 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Márcio Furlani Carmona updated SPARK-23308:
---
Comment: was deleted

(was: Yeah, I set it back to `ignoreCorruptFiles=false` to prevent this. But 
then if there's indeed a corrupt file, our job will never succeed until we fix 
that.

The biggest problem for me was that silent failure you mentioned. I just found 
out there was something wrong after running a job for the same input multiple 
times and noticing some missing data, then I started investigating the reason 
why and figured out it was due to this flag and the SocketTimeoutException i 
mentioned.

I agree the documentation should at least mention the risks of setting this 
flags and for which exceptions it considers the data as corrupt. Right now I 
believe this flag is not even documented officially, is it? 
https://spark.apache.org/docs/latest/configuration.html)

> ignoreCorruptFiles should not ignore retryable IOException
> --
>
> Key: SPARK-23308
> URL: https://issues.apache.org/jira/browse/SPARK-23308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Márcio Furlani Carmona
>Priority: Minor
>
> When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind 
> of RuntimeException or IOException, but some possible IOExceptions may happen 
> even if the file is not corrupted.
> One example is the SocketTimeoutException which can be retried to possibly 
> fetch the data without meaning the data is corrupted.
>  
> See: 
> https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException

2018-02-06 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354250#comment-16354250
 ] 

Márcio Furlani Carmona commented on SPARK-23308:


Yeah, I set it back to `ignoreCorruptFiles=false` to prevent this. But then if 
there's indeed a corrupt file, our job will never succeed until we fix that.

The biggest problem for me was that silent failure you mentioned. I just found 
out there was something wrong after running a job for the same input multiple 
times and noticing some missing data, then I started investigating the reason 
why and figured out it was due to this flag and the SocketTimeoutException I 
mentioned.

I agree the documentation should at least mention the risks of setting this 
flags and for which exceptions it considers the data as corrupt. Right now I 
believe this flag is not even documented officially, is it? 
https://spark.apache.org/docs/latest/configuration.html

> ignoreCorruptFiles should not ignore retryable IOException
> --
>
> Key: SPARK-23308
> URL: https://issues.apache.org/jira/browse/SPARK-23308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Márcio Furlani Carmona
>Priority: Minor
>
> When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind 
> of RuntimeException or IOException, but some possible IOExceptions may happen 
> even if the file is not corrupted.
> One example is the SocketTimeoutException which can be retried to possibly 
> fetch the data without meaning the data is corrupted.
>  
> See: 
> https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException

2018-02-06 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354248#comment-16354248
 ] 

Márcio Furlani Carmona commented on SPARK-23308:


Yeah, I set it back to `ignoreCorruptFiles=false` to prevent this. But then if 
there's indeed a corrupt file, our job will never succeed until we fix that.

The biggest problem for me was that silent failure you mentioned. I just found 
out there was something wrong after running a job for the same input multiple 
times and noticing some missing data, then I started investigating the reason 
why and figured out it was due to this flag and the SocketTimeoutException i 
mentioned.

I agree the documentation should at least mention the risks of setting this 
flags and for which exceptions it considers the data as corrupt. Right now I 
believe this flag is not even documented officially, is it? 
https://spark.apache.org/docs/latest/configuration.html

> ignoreCorruptFiles should not ignore retryable IOException
> --
>
> Key: SPARK-23308
> URL: https://issues.apache.org/jira/browse/SPARK-23308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Márcio Furlani Carmona
>Priority: Minor
>
> When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind 
> of RuntimeException or IOException, but some possible IOExceptions may happen 
> even if the file is not corrupted.
> One example is the SocketTimeoutException which can be retried to possibly 
> fetch the data without meaning the data is corrupted.
>  
> See: 
> https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22977) DataFrameWriter operations do not show details in SQL tab

2018-02-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22977:


Assignee: Apache Spark

> DataFrameWriter operations do not show details in SQL tab
> -
>
> Key: SPARK-22977
> URL: https://issues.apache.org/jira/browse/SPARK-22977
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.3.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
> Attachments: after.png, before.png
>
>
> When CreateHiveTableAsSelectCommand or  InsertIntoHiveTable, SQL tab don't 
> show details after 
> [SPARK-20213|https://issues.apache.org/jira/browse/SPARK-20213].
> *Before*:
> !before.png!
> *After*:
> !after.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22977) DataFrameWriter operations do not show details in SQL tab

2018-02-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354208#comment-16354208
 ] 

Apache Spark commented on SPARK-22977:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/20521

> DataFrameWriter operations do not show details in SQL tab
> -
>
> Key: SPARK-22977
> URL: https://issues.apache.org/jira/browse/SPARK-22977
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.3.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: after.png, before.png
>
>
> When CreateHiveTableAsSelectCommand or  InsertIntoHiveTable, SQL tab don't 
> show details after 
> [SPARK-20213|https://issues.apache.org/jira/browse/SPARK-20213].
> *Before*:
> !before.png!
> *After*:
> !after.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22977) DataFrameWriter operations do not show details in SQL tab

2018-02-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22977:


Assignee: (was: Apache Spark)

> DataFrameWriter operations do not show details in SQL tab
> -
>
> Key: SPARK-22977
> URL: https://issues.apache.org/jira/browse/SPARK-22977
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.3.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: after.png, before.png
>
>
> When CreateHiveTableAsSelectCommand or  InsertIntoHiveTable, SQL tab don't 
> show details after 
> [SPARK-20213|https://issues.apache.org/jira/browse/SPARK-20213].
> *Before*:
> !before.png!
> *After*:
> !after.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18356) KMeans should cache RDD before training

2018-02-06 Thread Lovasoa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354196#comment-16354196
 ] 

Lovasoa commented on SPARK-18356:
-

Will this also be fixed for BisectingKMeans ?

> KMeans should cache RDD before training
> ---
>
> Key: SPARK-18356
> URL: https://issues.apache.org/jira/browse/SPARK-18356
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0, 2.0.1
>Reporter: zakaria hili
>Assignee: zakaria hili
>Priority: Minor
>  Labels: easyfix
> Fix For: 2.2.0
>
>
> Hello,
> I'm newbie in spark, but I think that I found a small problem that can affect 
> spark Kmeans performances.
> Before starting to explain the problem, I want to explain the warning that I 
> faced.
> I tried to use Spark Kmeans with Dataframes to cluster my data
> df_Part = assembler.transform(df_Part)
> df_Part.cache()
> while (k<=max_cluster) and (wssse > seuilStop):
> kmeans = KMeans().setK(k)
> model = kmeans.fit(df_Part)
> wssse = model.computeCost(df_Part)
> k=k+1
> but when I run the code I receive the warning :
> WARN KMeans: The input data is not directly cached, which may hurt 
> performance if its parent RDDs are also uncached.
> I searched in spark source code to find the source of this problem, then I 
> realized there is two classes responsible for this warning: 
> (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala )
> (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala )
>  
> When my  dataframe is cached, the fit method transform my dataframe into an 
> internally rdd which is not cached.
> Dataframe -> rdd -> run Training Kmeans Algo(rdd)
> -> The first class (ml package) responsible for converting the dataframe into 
> rdd then call Kmeans Algorithm
> ->The second class (mllib package) implements Kmeans Algorithm, and here 
> spark verify if the rdd is cached, if not a warning will be generated.  
> So, the solution of this problem is to cache the rdd before running Kmeans 
> Algorithm.
> https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
> All what we need is to add two lines:
> Cache rdd just after dataframe transformation, then uncached it after 
> training algorithm.
> I hope that I was clear.
> If you think that I was wrong, please let me know.
> Sincerely,
> Zakaria HILI



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20659) Remove StorageStatus, or make it private.

2018-02-06 Thread Attila Zsolt Piros (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354191#comment-16354191
 ] 

Attila Zsolt Piros commented on SPARK-20659:


I would like to take this issue.

> Remove StorageStatus, or make it private.
> -
>
> Key: SPARK-20659
> URL: https://issues.apache.org/jira/browse/SPARK-20659
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> With the work being done in SPARK-18085, StorageStatus is not used anymore by 
> the UI. It's still used in a couple of other places, though:
> - {{SparkContext.getExecutorStorageStatus}}
> - {{BlockManagerSource}} (a metrics source)
> Both could be changed to use the REST API types; the first one could be 
> replaced with a new method in {{SparkStatusTracker}}, which I also think is a 
> better place for it anyway.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23329) Update the function descriptions with the arguments and returned values of the trigonometric functions

2018-02-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354178#comment-16354178
 ] 

Sean Owen commented on SPARK-23329:
---

Sure, though I might simplify that a bit.

{code}
/**
 * @param e angle in radians
 * @return sine of the angle, as if computed by [[java.lang.Math.sin]]
 * ...
 */
{code}

Perhaps this can be limited to improving the documentation of all trigonometric 
functions.

> Update the function descriptions with the arguments and returned values of 
> the trigonometric functions
> --
>
> Key: SPARK-23329
> URL: https://issues.apache.org/jira/browse/SPARK-23329
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Minor
>  Labels: starter
>
> We need an update on the function descriptions for all the trigonometric 
> functions. For example, {{cos}}, {{sin}}, and {{cot}}. Internally, the 
> implementation is based on the java.lang.Math. We need a clear description 
> about the units of the input arguments and the returned values. 
> For example, the following descriptions are lacking such info. 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L551-L555
> https://github.com/apache/spark/blob/d5861aba9d80ca15ad3f22793b79822e470d6913/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L1978



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23329) Update the function descriptions with the arguments and returned values of the trigonometric functions

2018-02-06 Thread Mihaly Toth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354166#comment-16354166
 ] 

Mihaly Toth commented on SPARK-23329:
-

How about this approach?

{code:scala}
  /**
   * Computes the sine of the given column. Works same as [[java.lang.Math.sin]]
   *
   * @param  e Column of angles, in radians.
   * @return new Column comprising sine value of each `e` element.
   *
   * @group math_funcs
   * @since 1.4.0
   */
  def sin(e: Column): Column = withExpr { Sin(e.expr) }
{code}

I am in a bit of trouble with wording. The original doc stated {{Computes the 
sine of the given}} *{{value}}* which is not really the case. Even _calculating 
the sine of a Column_ is not 100% precise but I guess not misleading given the 
context and condense enough on the other hand.

The unit of measurment can be possibly moved to the param description I believe.

Another question is that the majority of the javadocs in {{functions.scala}} is 
lacking return value and parameter descriptions. Does this Jira target to fix 
all of them (there are 334 ' def ' expressions in the file) or just the 
math_funcs group or only the trigonometric out of them as the title suggests?

> Update the function descriptions with the arguments and returned values of 
> the trigonometric functions
> --
>
> Key: SPARK-23329
> URL: https://issues.apache.org/jira/browse/SPARK-23329
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Minor
>  Labels: starter
>
> We need an update on the function descriptions for all the trigonometric 
> functions. For example, {{cos}}, {{sin}}, and {{cot}}. Internally, the 
> implementation is based on the java.lang.Math. We need a clear description 
> about the units of the input arguments and the returned values. 
> For example, the following descriptions are lacking such info. 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L551-L555
> https://github.com/apache/spark/blob/d5861aba9d80ca15ad3f22793b79822e470d6913/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L1978



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-06 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354045#comment-16354045
 ] 

Li Jin commented on SPARK-23314:


I think this is related to how Pandas deals with timestamp localization. I will 
spend some more time today.

> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> File "pandas/_libs/tslib.pyx", line 3593, in 
> pandas._libs.tslib.tz_localize_to_utc
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For details, see Comment box. I'm able to reproduce this on the latest 
> branch-2.3 (last change from Feb 1 UTC)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23271) Parquet output contains only "_SUCCESS" file after empty DataFrame saving

2018-02-06 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16353998#comment-16353998
 ] 

Xiao Li commented on SPARK-23271:
-

After a discussion with [~cloud_fan], we think the behavior should be 
consistent regardless the number of RDD partitions. Thanks!

> Parquet output contains only "_SUCCESS" file after empty DataFrame saving 
> --
>
> Key: SPARK-23271
> URL: https://issues.apache.org/jira/browse/SPARK-23271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Pavlo Z.
>Priority: Minor
> Attachments: parquet-empty-output.zip
>
>
> Sophisticated case, reproduced only if read empty CSV file without header 
> with assigned schema.
> Steps for reproduce (Scala):
> {code:java}
> val anySchema = StructType(StructField("anyName", StringType, nullable = 
> false) :: Nil)
> val inputDF = spark.read.schema(anySchema).csv(inputFolderWithEmptyCSVFile)
> inputDF.write.parquet(outputFolderName)
> // Exception: org.apache.spark.sql.AnalysisException: Unable to infer schema 
> for Parquet. It must be specified manually.;
> val actualDF = spark.read.parquet(outputFolderName)
>  
> {code}
> *Actual:* Only "_SUCCESS" file in output directory
> *Expected*: at least one Parquet file with schema.
> Project for reproduce is attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23344) Add KMeans distanceMeasure param to PySpark

2018-02-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-23344:
--
Priority: Minor  (was: Major)

(Aside: I'm not sure it's so constructive to add an API and only later add the 
other language bindings. But lots of people seem to do it this way here.)

> Add KMeans distanceMeasure param to PySpark
> ---
>
> Key: SPARK-23344
> URL: https://issues.apache.org/jira/browse/SPARK-23344
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Minor
>
> SPARK-22119 introduced a new parameter for KMeans, ie. {{distanceMeasure}}. 
> We should add it also to the Python interface.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23344) Add KMeans distanceMeasure param to PySpark

2018-02-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23344:


Assignee: Apache Spark

> Add KMeans distanceMeasure param to PySpark
> ---
>
> Key: SPARK-23344
> URL: https://issues.apache.org/jira/browse/SPARK-23344
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Apache Spark
>Priority: Major
>
> SPARK-22119 introduced a new parameter for KMeans, ie. {{distanceMeasure}}. 
> We should add it also to the Python interface.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23344) Add KMeans distanceMeasure param to PySpark

2018-02-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23344:


Assignee: (was: Apache Spark)

> Add KMeans distanceMeasure param to PySpark
> ---
>
> Key: SPARK-23344
> URL: https://issues.apache.org/jira/browse/SPARK-23344
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Major
>
> SPARK-22119 introduced a new parameter for KMeans, ie. {{distanceMeasure}}. 
> We should add it also to the Python interface.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23344) Add KMeans distanceMeasure param to PySpark

2018-02-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16353992#comment-16353992
 ] 

Apache Spark commented on SPARK-23344:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/20520

> Add KMeans distanceMeasure param to PySpark
> ---
>
> Key: SPARK-23344
> URL: https://issues.apache.org/jira/browse/SPARK-23344
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Major
>
> SPARK-22119 introduced a new parameter for KMeans, ie. {{distanceMeasure}}. 
> We should add it also to the Python interface.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23344) Add KMeans distanceMeasure param to PySpark

2018-02-06 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-23344:
---

 Summary: Add KMeans distanceMeasure param to PySpark
 Key: SPARK-23344
 URL: https://issues.apache.org/jira/browse/SPARK-23344
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.4.0
Reporter: Marco Gaido


SPARK-22119 introduced a new parameter for KMeans, ie. {{distanceMeasure}}. We 
should add it also to the Python interface.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2018-02-06 Thread Julien Cuquemelle (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16353953#comment-16353953
 ] 

Julien Cuquemelle commented on SPARK-22683:
---

[~tgraves]: the issue appears at each stage boundary, as for each beginning 
stage we might request a too large number of executors. 
Regarding the end of a single stage job, usually the over-allocated executors 
still consume resources even if they are released before the (30s) timeout at 
the end of the job, because we usually have even a single executor running a 
while longer than the others that will hold the unused execs to live 'till the 
end of the job.

Releasing executors faster would probably reduce the overhead, but in that case 
it would add overhead for multistage jobs that would have to re-spawn more 
executors at each stage. 

The metric I don't know is how long does a container live if we request it and 
release it immediately once it's available.

The idea about something similar to speculative execution seems interesting, 
but in that case we will still have a latency impact, as we will need to wait 
for a significant number of tasks to be finished to compute running time stats 
and decide whether we need more executors or not. If the tasks are very short, 
the metric is quickly available but then we don't request more executors, so we 
get a similar effect than using 2 or more tasks per taskSlot. If the tasks are 
long, we will need to wait for a first batch of tasks and then compute the 
correct number of executors to compute the rest of tasks. In that case the 
latency would not be better than roughly twice the time to run a single task, 
which is exactly what we get with 2 tasks per taskSlot.

Another view would be to start upfront with n>1 tasks per taskSlot, and gather 
stats about the time needed to spawn the required executors, then if the 
running tasks are significantly longer than executor allocation time, request 
new executors (and in that case we don't need to wait until completion of the 
tasks).

 

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>Priority: Major
>  Labels: pull-request-available
>
> While migrating a series of jobs from MR to Spark using dynamicAllocation, 
> I've noticed almost a doubling (+114% exactly) of resource consumption of 
> Spark w.r.t MR, for a wall clock time gain of 43%
> About the context: 
> - resource usage stands for vcore-hours allocation for the whole job, as seen 
> by YARN
> - I'm talking about a series of jobs because we provide our users with a way 
> to define experiments (via UI / DSL) that automatically get translated to 
> Spark / MR jobs and submitted on the cluster
> - we submit around 500 of such jobs each day
> - these jobs are usually one shot, and the amount of processing can vary a 
> lot between jobs, and as such finding an efficient number of executors for 
> each job is difficult to get right, which is the reason I took the path of 
> dynamic allocation.  
> - Some of the tests have been scheduled on an idle queue, some on a full 
> queue.
> - experiments have been conducted with spark.executor-cores = 5 and 10, only 
> results for 5 cores have been reported because efficiency was overall better 
> than with 10 cores
> - the figures I give are averaged over a representative sample of those jobs 
> (about 600 jobs) ranging from tens to thousands splits in the data 
> partitioning and between 400 to 9000 seconds of wall clock time.
> - executor idle timeout is set to 30s;
>  
> Definition: 
> - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
> which represent the max number of tasks an executor will process in parallel.
> - the current behaviour of the dynamic allocation is to allocate enough 
> containers to have one taskSlot per task, which minimizes latency, but wastes 
> resources when tasks are small regarding executor allocation and idling 
> overhead. 
> The results using the proposal (described below) over the job sample (600 
> jobs):
> - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in 
> resource usage, for a 37% (against 43%) reduction in wall clock time for 
> Spark w.r.t MR
> - by trying to minimize the average resource consumption, I ended up with 6 
> tasks per core, with a 30% resource usage reduction, for a similar wall clock 
> time w.r.t. MR
> What did I try to solve the issue with existing parameters (summing up a few 
> points mentioned in the comments) ?
> - change 

[jira] [Commented] (SPARK-23240) PythonWorkerFactory issues unhelpful message when pyspark.daemon produces bogus stdout

2018-02-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16353945#comment-16353945
 ] 

Apache Spark commented on SPARK-23240:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/20519

> PythonWorkerFactory issues unhelpful message when pyspark.daemon produces 
> bogus stdout
> --
>
> Key: SPARK-23240
> URL: https://issues.apache.org/jira/browse/SPARK-23240
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
>Reporter: Bruce Robbins
>Priority: Minor
>
> Environmental issues or site-local customizations (i.e., sitecustomize.py 
> present in the python install directory) can interfere with daemon.py’s 
> output to stdout. PythonWorkerFactory produces unhelpful messages when this 
> happens, causing some head scratching before the actual issue is determined.
> Case #1: Extraneous data in pyspark.daemon’s stdout. In this case, 
> PythonWorkerFactory uses the output as the daemon’s port number and ends up 
> throwing an exception when creating the socket:
> {noformat}
> java.lang.IllegalArgumentException: port out of range:1819239265
>   at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143)
>   at java.net.InetSocketAddress.(InetSocketAddress.java:188)
>   at java.net.Socket.(Socket.java:244)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:78)
> {noformat}
> Case #2: No data in pyspark.daemon’s stdout. In this case, 
> PythonWorkerFactory throws an EOFException exception reading the from the 
> Process input stream.
> The second case is somewhat less mysterious than the first, because 
> PythonWorkerFactory also displays the stderr from the python process.
> When there is unexpected or missing output in pyspark.daemon’s stdout, 
> PythonWorkerFactory should say so.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22119) Add cosine distance to KMeans

2018-02-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16353922#comment-16353922
 ] 

Apache Spark commented on SPARK-22119:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/20518

> Add cosine distance to KMeans
> -
>
> Key: SPARK-22119
> URL: https://issues.apache.org/jira/browse/SPARK-22119
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> Currently, KMeans assumes the only possible distance measure to be used is 
> the Euclidean.
> In some use cases, eg. text mining, other distance measures like the cosine 
> distance are widely used. Thus, for such use cases, it would be good to 
> support multiple distance measures.
> This ticket is to support the cosine distance measure on KMeans. Later, other 
> algorithms can be extended to support several distance measures and other 
> distance measures can be added.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23257) Implement Kerberos Support in Kubernetes resource manager

2018-02-06 Thread Rob Keevil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16353870#comment-16353870
 ] 

Rob Keevil commented on SPARK-23257:


[~ifilonenko] Happy to help, let me know when you are ready for me to look

> Implement Kerberos Support in Kubernetes resource manager
> -
>
> Key: SPARK-23257
> URL: https://issues.apache.org/jira/browse/SPARK-23257
> Project: Spark
>  Issue Type: Wish
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Rob Keevil
>Priority: Major
>
> On the forked k8s branch of Spark at 
> [https://github.com/apache-spark-on-k8s/spark/pull/540] , Kerberos support 
> has been added to the Kubernetes resource manager.  The Kubernetes code 
> between these two repositories appears to have diverged, so this commit 
> cannot be merged in easily.  Are there any plans to re-implement this work on 
> the main Spark repository?
>  
> [ifilonenko|https://github.com/ifilonenko] [~liyinan926] I am happy to help 
> with the development and testing of this, but i wanted to confirm that this 
> isn't already in progress -  I could not find any discussion about this 
> specific topic online.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23328) Disallow default value None in na.replace/replace when 'to_replace' is not a dictionary

2018-02-06 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-23328:
-
Target Version/s:   (was: 2.3.0)

> Disallow default value None in na.replace/replace when 'to_replace' is not a 
> dictionary
> ---
>
> Key: SPARK-23328
> URL: https://issues.apache.org/jira/browse/SPARK-23328
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We happened to set {{None}} as a default value via SPARK-19454 which is quite 
> weird.
> Looks we better only do this when the input is dictionary.
> Please see https://github.com/apache/spark/pull/16793#issuecomment-362684399



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-10063) Remove DirectParquetOutputCommitter

2018-02-06 Thread Henrique dos Santos Goulart (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henrique dos Santos Goulart updated SPARK-10063:

Comment: was deleted

(was: There is any alternative right now that works with Parquet that uses 
partitionBy? Because it works very well if I set version=2 and do not use 
paritionBy parquet, but if I use 
dataset.partitionBy(..).option(..algorithm.version", "2").parquet(...) it will 
create temporary folders =(

Reference question: 
[https://stackoverflow.com/questions/48573643/spark-dataset-parquet-partition-on-s3-creating-temporary-folder]

 

[~rxin] [~yhuai] [~ste...@apache.org] [~chiragvaya])

> Remove DirectParquetOutputCommitter
> ---
>
> Key: SPARK-10063
> URL: https://issues.apache.org/jira/browse/SPARK-10063
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Reynold Xin
>Priority: Critical
> Fix For: 2.0.0
>
>
> When we use DirectParquetOutputCommitter on S3 and speculation is enabled, 
> there is a chance that we can loss data. 
> Here is the code to reproduce the problem.
> {code}
> import org.apache.spark.sql.functions._
> val failSpeculativeTask = sqlContext.udf.register("failSpeculativeTask", (i: 
> Int, partitionId: Int, attemptNumber: Int) => {
>   if (partitionId == 0 && i == 5) {
> if (attemptNumber > 0) {
>   Thread.sleep(15000)
>   throw new Exception("new exception")
> } else {
>   Thread.sleep(1)
> }
>   }
>   
>   i
> })
> val df = sc.parallelize((1 to 100), 20).mapPartitions { iter =>
>   val context = org.apache.spark.TaskContext.get()
>   val partitionId = context.partitionId
>   val attemptNumber = context.attemptNumber
>   iter.map(i => (i, partitionId, attemptNumber))
> }.toDF("i", "partitionId", "attemptNumber")
> df
>   .select(failSpeculativeTask($"i", $"partitionId", 
> $"attemptNumber").as("i"), $"partitionId", $"attemptNumber")
>   .write.mode("overwrite").format("parquet").save("/home/yin/outputCommitter")
> sqlContext.read.load("/home/yin/outputCommitter").count
> // The result is 99 and 5 is missing from the output.
> {code}
> What happened is that the original task finishes first and uploads its output 
> file to S3, then the speculative task somehow fails. Because we have to call 
> output stream's close method, which uploads data to S3, we actually uploads 
> the partial result generated by the failed speculative task to S3 and this 
> file overwrites the correct file generated by the original task.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23336) Upgrade snappy-java to 1.1.7.1

2018-02-06 Thread Yuming Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-23336:

Summary: Upgrade snappy-java to 1.1.7.1  (was: Upgrade snappy-java to 1.1.4)

> Upgrade snappy-java to 1.1.7.1
> --
>
> Key: SPARK-23336
> URL: https://issues.apache.org/jira/browse/SPARK-23336
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Minor
>
> We should upgrade the snappy-java version to improve performance compression 
> (5%) and decompression (20%).
> Details:
>  
> [https://github.com/xerial/snappy-java/blob/master/Milestone.md#snappy-java-114-2017-05-22]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23053) taskBinarySerialization and task partitions calculate in DagScheduler.submitMissingTasks should keep the same RDD checkpoint status

2018-02-06 Thread huangtengfei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16353761#comment-16353761
 ] 

huangtengfei commented on SPARK-23053:
--

here is the stack trace of exception.

java.lang.ClassCastException: org.apache.spark.rdd.CheckpointRDDPartition 
cannot be cast to org.apache.spark.streaming.rdd.MapWithStateRDDPartition
at 
org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:152)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)

> taskBinarySerialization and task partitions calculate in 
> DagScheduler.submitMissingTasks should keep the same RDD checkpoint status
> ---
>
> Key: SPARK-23053
> URL: https://issues.apache.org/jira/browse/SPARK-23053
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.1.0
>Reporter: huangtengfei
>Priority: Major
>
> When we run concurrent jobs using the same rdd which is marked to do 
> checkpoint. If one job has finished running the job, and start the process of 
> RDD.doCheckpoint, while another job is submitted, then submitStage and 
> submitMissingTasks will be called. In 
> [submitMissingTasks|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L961],
>  will serialize taskBinaryBytes and calculate task partitions which are both 
> affected by the status of checkpoint, if the former is calculated before 
> doCheckpoint finished, while the latter is calculated after doCheckpoint 
> finished, when run task, rdd.compute will be called, for some rdds with 
> particular partition type such as 
> [MapWithStateRDD|https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/rdd/MapWithStateRDD.scala]
>  who will do partition type cast, will get a ClassCastException because the 
> part params is actually a CheckpointRDDPartition.
> This error occurs because rdd.doCheckpoint occurs in the same thread that 
> called sc.runJob, while the task serialization occurs in the DAGSchedulers 
> event loop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23053) taskBinarySerialization and task partitions calculate in DagScheduler.submitMissingTasks should keep the same RDD checkpoint status

2018-02-06 Thread huangtengfei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16353761#comment-16353761
 ] 

huangtengfei edited comment on SPARK-23053 at 2/6/18 11:48 AM:
---

here is the stack trace of exception.

{code:java}
java.lang.ClassCastException: org.apache.spark.rdd.CheckpointRDDPartition 
cannot be cast to org.apache.spark.streaming.rdd.MapWithStateRDDPartition
at 
org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:152)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
{code}



was (Author: ivoson):
here is the stack trace of exception.

java.lang.ClassCastException: org.apache.spark.rdd.CheckpointRDDPartition 
cannot be cast to org.apache.spark.streaming.rdd.MapWithStateRDDPartition
at 
org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:152)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)

> taskBinarySerialization and task partitions calculate in 
> DagScheduler.submitMissingTasks should keep the same RDD checkpoint status
> ---
>
> Key: SPARK-23053
> URL: https://issues.apache.org/jira/browse/SPARK-23053
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.1.0
>Reporter: huangtengfei
>Priority: Major
>
> When we run concurrent jobs using the same rdd which is marked to do 
> checkpoint. If one job has finished running the job, and start the process of 
> RDD.doCheckpoint, while another job is submitted, then submitStage and 
> submitMissingTasks will be called. In 
> [submitMissingTasks|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L961],
>  will serialize taskBinaryBytes and calculate task partitions which are both 
> affected by the status of checkpoint, if the former is calculated before 
> doCheckpoint finished, while the latter is calculated after doCheckpoint 
> finished, when run task, rdd.compute will be called, for some rdds with 
> particular partition type such as 
> [MapWithStateRDD|https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/rdd/MapWithStateRDD.scala]
>  who will do partition type cast, will get a ClassCastException because the 
> part params is actually a CheckpointRDDPartition.
> This error occurs because rdd.doCheckpoint occurs in the same thread that 
> called sc.runJob, while the task serialization occurs in the DAGSchedulers 
> event loop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23053) taskBinarySerialization and task partitions calculate in DagScheduler.submitMissingTasks should keep the same RDD checkpoint status

2018-02-06 Thread huangtengfei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangtengfei updated SPARK-23053:
-
Description: 
When we run concurrent jobs using the same rdd which is marked to do 
checkpoint. If one job has finished running the job, and start the process of 
RDD.doCheckpoint, while another job is submitted, then submitStage and 
submitMissingTasks will be called. In 
[submitMissingTasks|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L961],
 will serialize taskBinaryBytes and calculate task partitions which are both 
affected by the status of checkpoint, if the former is calculated before 
doCheckpoint finished, while the latter is calculated after doCheckpoint 
finished, when run task, rdd.compute will be called, for some rdds with 
particular partition type such as 
[MapWithStateRDD|https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/rdd/MapWithStateRDD.scala]
 who will do partition type cast, will get a ClassCastException because the 
part params is actually a CheckpointRDDPartition.
This error occurs because rdd.doCheckpoint occurs in the same thread that 
called sc.runJob, while the task serialization occurs in the DAGSchedulers 
event loop.

  was:When we run concurrent jobs using the same rdd which is marked to do 
checkpoint. If one job has finished running the job, and start the process of 
RDD.doCheckpoint, while another job is submitted, then submitStage and 
submitMissingTasks will be called. In 
[submitMissingTasks|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L961],
 will serialize taskBinaryBytes and calculate task partitions which are both 
affected by the status of checkpoint, if the former is calculated before 
doCheckpoint finished, while the latter is calculated after doCheckpoint 
finished, when run task, rdd.compute will be called, for some rdds with 
particular partition type such as 
[MapWithStateRDD|https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/rdd/MapWithStateRDD.scala]
 who will do partition type cast, will get a ClassCastException because the 
part params is actually a CheckpointRDDPartition.


> taskBinarySerialization and task partitions calculate in 
> DagScheduler.submitMissingTasks should keep the same RDD checkpoint status
> ---
>
> Key: SPARK-23053
> URL: https://issues.apache.org/jira/browse/SPARK-23053
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.1.0
>Reporter: huangtengfei
>Priority: Major
>
> When we run concurrent jobs using the same rdd which is marked to do 
> checkpoint. If one job has finished running the job, and start the process of 
> RDD.doCheckpoint, while another job is submitted, then submitStage and 
> submitMissingTasks will be called. In 
> [submitMissingTasks|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L961],
>  will serialize taskBinaryBytes and calculate task partitions which are both 
> affected by the status of checkpoint, if the former is calculated before 
> doCheckpoint finished, while the latter is calculated after doCheckpoint 
> finished, when run task, rdd.compute will be called, for some rdds with 
> particular partition type such as 
> [MapWithStateRDD|https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/rdd/MapWithStateRDD.scala]
>  who will do partition type cast, will get a ClassCastException because the 
> part params is actually a CheckpointRDDPartition.
> This error occurs because rdd.doCheckpoint occurs in the same thread that 
> called sc.runJob, while the task serialization occurs in the DAGSchedulers 
> event loop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19870) Repeatable deadlock on BlockInfoManager and TorrentBroadcast

2018-02-06 Thread Eyal Farago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16353651#comment-16353651
 ] 

Eyal Farago commented on SPARK-19870:
-

we've also seen something very similar (stack traces attach) with spark 2.1.0, 
in our case this happened after several OOM conditions and executors being 
killed and restarted (new AWS account was limited on the chosen instance type 
we chose).

I've stared at these stack traces a lot and 'walked alone' the block manager 
(and block info manager) code paths, my conclusion was that a thread leaks a 
writer lock for at least one block, further investigation shows that tasks are 
registered for cleanup which should eliminate the potential problem (block info 
manager maintains maps of readers/writers per block).

However, there are other threads living inside a spark executor, such as the 
BlockTransferService that are not tasks hence have no ID and guaranteed 
cleanup...

I think I found a potential weak spot in the way spark evicts blocks from 
memory (org.apache.spark.storage.memory.MemoryStore#evictBlocksToFreeSpace) 
that has the potential of keep holding locks in case of failure. SPARK-22083 
seems to fix this so I'd recommend trying out one of the versions including the 
fix (we're currently checking the possibility to do so with the distros we're 
using).

 

[~irashid], you fixed SPARK–22083, do the scenarios referred in this issue 
resembles what you were fixing? does my analysis make sense to you?

 

attaching my stack traces:
{noformat}
Thread ID   Thread Name Thread StateThread Locks

148 Executor task launch worker for task 75639  WAITING 
Lock(java.util.concurrent.ThreadPoolExecutor$Worker@155076597})
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:502)
org.apache.spark.storage.BlockInfoManager.lockForReading(BlockInfoManager.scala:202)
org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:450)
org.apache.spark.storage.BlockManager.get(BlockManager.scala:636)
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:693)
org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
scala.collection.immutable.List.foreach(List.scala:381)
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
org.apache.spark.scheduler.Task.run(Task.scala:99)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)

168 Executor task launch worker for task 75642  WAITING 
Lock(java.util.concurrent.ThreadPoolExecutor$Worker@1540715596})
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:502)
org.apache.spark.storage.BlockInfoManager.lockForReading(BlockInfoManager.scala:202)
org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:450)
org.apache.spark.storage.BlockManager.get(BlockManager.scala:636)
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:693)
org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)

[jira] [Updated] (SPARK-23337) withWatermark raises an exception on struct objects

2018-02-06 Thread Aydin Kocas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aydin Kocas updated SPARK-23337:

Description: 
Hi,

 

when using a nested object (I mean an object within a struct, here concrete: 
_source.createTime) from a json file as the parameter for the 
withWatermark-method, I get an exception (see below).

Anything else works flawlessly with the nested object.

 

+*{color:#14892c}works:{color}*+ 
{code:java}
Dataset jsonRow = 
spark.readStream().schema(getJSONSchema()).json(file).dropDuplicates("_id").withWatermark("myTime",
 "10 seconds").toDF();{code}
 

json structure:
{code:java}
root
 |-- _id: string (nullable = true)
 |-- _index: string (nullable = true)
 |-- _score: long (nullable = true)
 |-- myTime: timestamp (nullable = true)
..{code}
+*{color:#d04437}does not work - nested json{color}:*+
{code:java}
Dataset jsonRow = 
spark.readStream().schema(getJSONSchema()).json(file).dropDuplicates("_id").withWatermark("_source.createTime",
 "10 seconds").toDF();{code}
 

json structure:

 
{code:java}
root
 |-- _id: string (nullable = true)
 |-- _index: string (nullable = true)
 |-- _score: long (nullable = true)
 |-- _source: struct (nullable = true)
 | |-- createTime: timestamp (nullable = true)
..
 
Exception in thread "main" 
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree:
'EventTimeWatermark '_source.createTime, interval 10 seconds
+- Deduplicate [_id#0], true
 +- StreamingRelation 
DataSource(org.apache.spark.sql.SparkSession@5dbbb292,json,List(),Some(StructType(StructField(_id,StringType,true),
 StructField(_index,StringType,true), StructField(_score,LongType,true), 
StructField(_source,StructType(StructField(additionalData,StringType,true), 
StructField(client,StringType,true), 
StructField(clientDomain,BooleanType,true), 
StructField(clientVersion,StringType,true), 
StructField(country,StringType,true), StructField(countryName,StringType,true), 
StructField(createTime,TimestampType,true), 
StructField(externalIP,StringType,true), StructField(hostname,StringType,true), 
StructField(internalIP,StringType,true), StructField(location,StringType,true), 
StructField(locationDestination,StringType,true), 
StructField(login,StringType,true), 
StructField(originalRequestString,StringType,true), 
StructField(password,StringType,true), StructField(peerIdent,StringType,true), 
StructField(peerType,StringType,true), 
StructField(recievedTime,TimestampType,true), 
StructField(sessionEnd,StringType,true), 
StructField(sessionStart,StringType,true), 
StructField(sourceEntryAS,StringType,true), 
StructField(sourceEntryIp,StringType,true), 
StructField(sourceEntryPort,StringType,true), 
StructField(targetCountry,StringType,true), 
StructField(targetCountryName,StringType,true), 
StructField(targetEntryAS,StringType,true), 
StructField(targetEntryIp,StringType,true), 
StructField(targetEntryPort,StringType,true), 
StructField(targetport,StringType,true), StructField(username,StringType,true), 
StructField(vulnid,StringType,true)),true), 
StructField(_type,StringType,true))),List(),None,Map(path -> ./input/),None), 
FileSource[./input/], [_id#0, _index#1, _score#2L, _source#3, _type#4]
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
 at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385)
 at 
org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:300)
 at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:268)
 at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:854)
 at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:796)
 at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62)
 at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
 at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:61)
 at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:796)
 at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:674)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
 at 
scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
 at scala.collection.immutable.List.foldLeft(List.scala:84)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
 at 

[jira] [Resolved] (SPARK-23334) Fix pandas_udf with return type StringType() to handle str type properly in Python 2.

2018-02-06 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23334.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20507
[https://github.com/apache/spark/pull/20507]

> Fix pandas_udf with return type StringType() to handle str type properly in 
> Python 2.
> -
>
> Key: SPARK-23334
> URL: https://issues.apache.org/jira/browse/SPARK-23334
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Blocker
> Fix For: 2.3.0
>
>
> In Python 2, when pandas_udf tries to return string type value created in the 
> udf with {{".."}}, the execution fails. E.g.,
> {code:java}
> from pyspark.sql.functions import pandas_udf, col
> import pandas as pd
> df = spark.range(10)
> str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), "string")
> df.select(str_f(col('id'))).show()
> {code}
> raises the following exception:
> {code}
> ...
> java.lang.AssertionError: assertion failed: Invalid schema from pandas_udf: 
> expected StringType, got BinaryType
>   at scala.Predef$.assert(Predef.scala:170)
>   at 
> org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.(ArrowEvalPythonExec.scala:93)
> ...
> {code}
> Seems like pyarrow ignores {{type}} parameter for {{pa.Array.from_pandas()}} 
> and consider it as binary type when the type is string type and the string 
> values are {{str}} instead of {{unicode}} in Python 2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23334) Fix pandas_udf with return type StringType() to handle str type properly in Python 2.

2018-02-06 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-23334:


Assignee: Takuya Ueshin

> Fix pandas_udf with return type StringType() to handle str type properly in 
> Python 2.
> -
>
> Key: SPARK-23334
> URL: https://issues.apache.org/jira/browse/SPARK-23334
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Blocker
> Fix For: 2.3.0
>
>
> In Python 2, when pandas_udf tries to return string type value created in the 
> udf with {{".."}}, the execution fails. E.g.,
> {code:java}
> from pyspark.sql.functions import pandas_udf, col
> import pandas as pd
> df = spark.range(10)
> str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), "string")
> df.select(str_f(col('id'))).show()
> {code}
> raises the following exception:
> {code}
> ...
> java.lang.AssertionError: assertion failed: Invalid schema from pandas_udf: 
> expected StringType, got BinaryType
>   at scala.Predef$.assert(Predef.scala:170)
>   at 
> org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.(ArrowEvalPythonExec.scala:93)
> ...
> {code}
> Seems like pyarrow ignores {{type}} parameter for {{pa.Array.from_pandas()}} 
> and consider it as binary type when the type is string type and the string 
> values are {{str}} instead of {{unicode}} in Python 2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23290) inadvertent change in handling of DateType when converting to pandas dataframe

2018-02-06 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23290.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20515
[https://github.com/apache/spark/pull/20515]

> inadvertent change in handling of DateType when converting to pandas dataframe
> --
>
> Key: SPARK-23290
> URL: https://issues.apache.org/jira/browse/SPARK-23290
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Andre Menck
>Assignee: Takuya Ueshin
>Priority: Blocker
> Fix For: 2.3.0
>
>
> In [this 
> PR|https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968]
>  there was a change in how `DateType` is being returned to users (line 1968 
> in dataframe.py). This can cause client code to fail, as in the following 
> example from a python terminal:
> {code:python}
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> dateobject
> num  int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> 02015-01-01
> Name: date, dtype: object
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> dateobject
> num  int64
> dtype: object
> >>> pdf['date'] = pd.to_datetime(pdf['date'])
> >>> pdf.dtypes
> datedatetime64[ns]
> num  int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/amenck/anaconda2/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2355, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/_libs/src/inference.pyx", line 1574, in 
> pandas._libs.lib.map_infer
>   File "", line 1, in 
> TypeError: strptime() argument 1 must be string, not Timestamp
> >>> 
> {code}
> Above we show both the old behavior (returning an "object" col) and the new 
> behavior (returning a datetime column). Since there may be user code relying 
> on the old behavior, I'd suggest reverting this specific part of this change. 
> Also note that the NOTE on the docstring for the "_to_corrected_pandas_type" 
> seems to be off, referring to the old behavior and not the current one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23290) inadvertent change in handling of DateType when converting to pandas dataframe

2018-02-06 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-23290:


Assignee: Takuya Ueshin

> inadvertent change in handling of DateType when converting to pandas dataframe
> --
>
> Key: SPARK-23290
> URL: https://issues.apache.org/jira/browse/SPARK-23290
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Andre Menck
>Assignee: Takuya Ueshin
>Priority: Blocker
> Fix For: 2.3.0
>
>
> In [this 
> PR|https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968]
>  there was a change in how `DateType` is being returned to users (line 1968 
> in dataframe.py). This can cause client code to fail, as in the following 
> example from a python terminal:
> {code:python}
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> dateobject
> num  int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> 02015-01-01
> Name: date, dtype: object
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> dateobject
> num  int64
> dtype: object
> >>> pdf['date'] = pd.to_datetime(pdf['date'])
> >>> pdf.dtypes
> datedatetime64[ns]
> num  int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/amenck/anaconda2/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2355, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/_libs/src/inference.pyx", line 1574, in 
> pandas._libs.lib.map_infer
>   File "", line 1, in 
> TypeError: strptime() argument 1 must be string, not Timestamp
> >>> 
> {code}
> Above we show both the old behavior (returning an "object" col) and the new 
> behavior (returning a datetime column). Since there may be user code relying 
> on the old behavior, I'd suggest reverting this specific part of this change. 
> Also note that the NOTE on the docstring for the "_to_corrected_pandas_type" 
> seems to be off, referring to the old behavior and not the current one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23342) Add ORC configuration tests for ORC data source

2018-02-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23342:


Assignee: Apache Spark

> Add ORC configuration tests for ORC data source
> ---
>
> Key: SPARK-23342
> URL: https://issues.apache.org/jira/browse/SPARK-23342
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> This issue adds test coverage for ORC configuration with ORC names and Hive 
> names.
> *Example:*
> - orc.stripe.size, hive.exec.orc.default.stripe.size
> - orc.row.index.stride, hive.exec.orc.default.row.index.stride
> - orc.compress.size, hive.exec.orc.default.buffer.size



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23342) Add ORC configuration tests for ORC data source

2018-02-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16353608#comment-16353608
 ] 

Apache Spark commented on SPARK-23342:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/20517

> Add ORC configuration tests for ORC data source
> ---
>
> Key: SPARK-23342
> URL: https://issues.apache.org/jira/browse/SPARK-23342
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue adds test coverage for ORC configuration with ORC names and Hive 
> names.
> *Example:*
> - orc.stripe.size, hive.exec.orc.default.stripe.size
> - orc.row.index.stride, hive.exec.orc.default.row.index.stride
> - orc.compress.size, hive.exec.orc.default.buffer.size



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23342) Add ORC configuration tests for ORC data source

2018-02-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23342:


Assignee: (was: Apache Spark)

> Add ORC configuration tests for ORC data source
> ---
>
> Key: SPARK-23342
> URL: https://issues.apache.org/jira/browse/SPARK-23342
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue adds test coverage for ORC configuration with ORC names and Hive 
> names.
> *Example:*
> - orc.stripe.size, hive.exec.orc.default.stripe.size
> - orc.row.index.stride, hive.exec.orc.default.row.index.stride
> - orc.compress.size, hive.exec.orc.default.buffer.size



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23343) Increase the exception test for the bind port

2018-02-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23343:


Assignee: Apache Spark

> Increase the exception test for the bind port
> -
>
> Key: SPARK-23343
> URL: https://issues.apache.org/jira/browse/SPARK-23343
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: caoxuewen
>Assignee: Apache Spark
>Priority: Minor
>
> this PR add new test case, 
> 1、add the boundary value test of port 65535
> 2、add the privileged port to test,
> 3、add rebinding port test when set `spark.port.maxRetries` is 1,
> 4、add `Utils.userPort` self circulation to generating port,
> in addition, in the existing test case, if you don't set the `spark.testing` 
> for true, the default value for `spark.port.maxRetries` is not 100, but 16, 
> (expectedPort + 100) is a little mistake.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23343) Increase the exception test for the bind port

2018-02-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16353604#comment-16353604
 ] 

Apache Spark commented on SPARK-23343:
--

User 'heary-cao' has created a pull request for this issue:
https://github.com/apache/spark/pull/20516

> Increase the exception test for the bind port
> -
>
> Key: SPARK-23343
> URL: https://issues.apache.org/jira/browse/SPARK-23343
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: caoxuewen
>Priority: Minor
>
> this PR add new test case, 
> 1、add the boundary value test of port 65535
> 2、add the privileged port to test,
> 3、add rebinding port test when set `spark.port.maxRetries` is 1,
> 4、add `Utils.userPort` self circulation to generating port,
> in addition, in the existing test case, if you don't set the `spark.testing` 
> for true, the default value for `spark.port.maxRetries` is not 100, but 16, 
> (expectedPort + 100) is a little mistake.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23343) Increase the exception test for the bind port

2018-02-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23343:


Assignee: (was: Apache Spark)

> Increase the exception test for the bind port
> -
>
> Key: SPARK-23343
> URL: https://issues.apache.org/jira/browse/SPARK-23343
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: caoxuewen
>Priority: Minor
>
> this PR add new test case, 
> 1、add the boundary value test of port 65535
> 2、add the privileged port to test,
> 3、add rebinding port test when set `spark.port.maxRetries` is 1,
> 4、add `Utils.userPort` self circulation to generating port,
> in addition, in the existing test case, if you don't set the `spark.testing` 
> for true, the default value for `spark.port.maxRetries` is not 100, but 16, 
> (expectedPort + 100) is a little mistake.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23343) Increase the exception test for the bind port

2018-02-06 Thread caoxuewen (JIRA)
caoxuewen created SPARK-23343:
-

 Summary: Increase the exception test for the bind port
 Key: SPARK-23343
 URL: https://issues.apache.org/jira/browse/SPARK-23343
 Project: Spark
  Issue Type: Test
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: caoxuewen


this PR add new test case, 
1、add the boundary value test of port 65535
2、add the privileged port to test,
3、add rebinding port test when set `spark.port.maxRetries` is 1,
4、add `Utils.userPort` self circulation to generating port,

in addition, in the existing test case, if you don't set the `spark.testing` 
for true, the default value for `spark.port.maxRetries` is not 100, but 16, 
(expectedPort + 100) is a little mistake.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23342) Add ORC configuration tests for ORC data source

2018-02-06 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-23342:
-

 Summary: Add ORC configuration tests for ORC data source
 Key: SPARK-23342
 URL: https://issues.apache.org/jira/browse/SPARK-23342
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.3.0
Reporter: Dongjoon Hyun


This issue adds test coverage for ORC configuration with ORC names and Hive 
names.

*Example:*
- orc.stripe.size, hive.exec.orc.default.stripe.size
- orc.row.index.stride, hive.exec.orc.default.row.index.stride
- orc.compress.size, hive.exec.orc.default.buffer.size





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org