[jira] [Commented] (SPARK-22271) Describe results in "null" for the value of "mean" of a numeric variable

2017-10-12 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203095#comment-16203095
 ] 

Takeshi Yamamuro commented on SPARK-22271:
--

More->Attach Files? btw, text file (csv or something) is better, I think.

> Describe results in "null" for the value of "mean" of a numeric variable
> 
>
> Key: SPARK-22271
> URL: https://issues.apache.org/jira/browse/SPARK-22271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: 
>Reporter: Shafique Jamal
>Priority: Minor
>
> Please excuse me if this issue was addressed already - I was unable to find 
> it.
> Calling .describe().show() on my dataframe results in a value of null for the 
> row "mean":
> {noformat}
> val foo = spark.read.parquet("decimalNumbers.parquet")
> foo.select(col("numericvariable")).describe().show()
> foo: org.apache.spark.sql.DataFrame = [numericvariable: decimal(38,32)]
> +---++
> |summary| numericvariable|
> +---++
> |  count| 299|
> |   mean|null|
> | stddev|  0.2376438793946738|
> |min|0.037815489727642...|
> |max|2.138189366554511...|
> {noformat}
> But all of the rows for this seem ok (I can attache a parquet file). When I 
> round the column, however, all is fine:
> {noformat}
> foo.select(bround(col("numericvariable"), 31)).describe().show()
> +---+---+
> |summary|bround(numericvariable, 31)|
> +---+---+
> |  count|299|
> |   mean|   0.139522503183236...|
> | stddev| 0.2376438793946738|
> |min|   0.037815489727642...|
> |max|   2.138189366554511...|
> +---+---+
> {noformat}
> Rounding using 32 gives null also though.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22271) Describe results in "null" for the value of "mean" of a numeric variable

2017-10-12 Thread Shafique Jamal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203094#comment-16203094
 ] 

Shafique Jamal commented on SPARK-22271:


I'm happy to share the parquet file - how can I upload it? I don't see an 
option to upload a file (only to upload an image). Thanks!

> Describe results in "null" for the value of "mean" of a numeric variable
> 
>
> Key: SPARK-22271
> URL: https://issues.apache.org/jira/browse/SPARK-22271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: 
>Reporter: Shafique Jamal
>Priority: Minor
>
> Please excuse me if this issue was addressed already - I was unable to find 
> it.
> Calling .describe().show() on my dataframe results in a value of null for the 
> row "mean":
> {noformat}
> val foo = spark.read.parquet("decimalNumbers.parquet")
> foo.select(col("numericvariable")).describe().show()
> foo: org.apache.spark.sql.DataFrame = [numericvariable: decimal(38,32)]
> +---++
> |summary| numericvariable|
> +---++
> |  count| 299|
> |   mean|null|
> | stddev|  0.2376438793946738|
> |min|0.037815489727642...|
> |max|2.138189366554511...|
> {noformat}
> But all of the rows for this seem ok (I can attache a parquet file). When I 
> round the column, however, all is fine:
> {noformat}
> foo.select(bround(col("numericvariable"), 31)).describe().show()
> +---+---+
> |summary|bround(numericvariable, 31)|
> +---+---+
> |  count|299|
> |   mean|   0.139522503183236...|
> | stddev| 0.2376438793946738|
> |min|   0.037815489727642...|
> |max|   2.138189366554511...|
> +---+---+
> {noformat}
> Rounding using 32 gives null also though.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22266) The same aggregate function was evaluated multiple times

2017-10-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22266:


Assignee: (was: Apache Spark)

> The same aggregate function was evaluated multiple times
> 
>
> Key: SPARK-22266
> URL: https://issues.apache.org/jira/browse/SPARK-22266
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Maryann Xue
>Priority: Minor
>
> We should avoid the same aggregate function being evaluated more than once, 
> and this is what has been stated in the code comment below 
> (patterns.scala:206). However things didn't work as expected.
> {code}
>   // A single aggregate expression might appear multiple times in 
> resultExpressions.
>   // In order to avoid evaluating an individual aggregate function 
> multiple times, we'll
>   // build a set of the distinct aggregate expressions and build a 
> function which can
>   // be used to re-write expressions so that they reference the single 
> copy of the
>   // aggregate function which actually gets computed.
> {code}
> For example, the physical plan of
> {code}
> SELECT a, max(b+1), max(b+1) + 1 FROM testData2 GROUP BY a
> {code}
> was
> {code}
> HashAggregate(keys=[a#23], functions=[max((b#24 + 1)), max((b#24 + 1))], 
> output=[a#23, max((b + 1))#223, (max((b + 1)) + 1)#224])
> +- HashAggregate(keys=[a#23], functions=[partial_max((b#24 + 1)), 
> partial_max((b#24 + 1))], output=[a#23, max#231, max#232])
>+- SerializeFromObject [assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData2, true]).a AS a#23, 
> assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, 
> true]).b AS b#24]
>   +- Scan ExternalRDDScan[obj#22]
> {code}
> , where in each HashAggregate there were two identical aggregate functions 
> "max(b#24 + 1)".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22266) The same aggregate function was evaluated multiple times

2017-10-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22266:


Assignee: Apache Spark

> The same aggregate function was evaluated multiple times
> 
>
> Key: SPARK-22266
> URL: https://issues.apache.org/jira/browse/SPARK-22266
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Maryann Xue
>Assignee: Apache Spark
>Priority: Minor
>
> We should avoid the same aggregate function being evaluated more than once, 
> and this is what has been stated in the code comment below 
> (patterns.scala:206). However things didn't work as expected.
> {code}
>   // A single aggregate expression might appear multiple times in 
> resultExpressions.
>   // In order to avoid evaluating an individual aggregate function 
> multiple times, we'll
>   // build a set of the distinct aggregate expressions and build a 
> function which can
>   // be used to re-write expressions so that they reference the single 
> copy of the
>   // aggregate function which actually gets computed.
> {code}
> For example, the physical plan of
> {code}
> SELECT a, max(b+1), max(b+1) + 1 FROM testData2 GROUP BY a
> {code}
> was
> {code}
> HashAggregate(keys=[a#23], functions=[max((b#24 + 1)), max((b#24 + 1))], 
> output=[a#23, max((b + 1))#223, (max((b + 1)) + 1)#224])
> +- HashAggregate(keys=[a#23], functions=[partial_max((b#24 + 1)), 
> partial_max((b#24 + 1))], output=[a#23, max#231, max#232])
>+- SerializeFromObject [assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData2, true]).a AS a#23, 
> assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, 
> true]).b AS b#24]
>   +- Scan ExternalRDDScan[obj#22]
> {code}
> , where in each HashAggregate there were two identical aggregate functions 
> "max(b#24 + 1)".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22266) The same aggregate function was evaluated multiple times

2017-10-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203091#comment-16203091
 ] 

Apache Spark commented on SPARK-22266:
--

User 'maryannxue' has created a pull request for this issue:
https://github.com/apache/spark/pull/19488

> The same aggregate function was evaluated multiple times
> 
>
> Key: SPARK-22266
> URL: https://issues.apache.org/jira/browse/SPARK-22266
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Maryann Xue
>Priority: Minor
>
> We should avoid the same aggregate function being evaluated more than once, 
> and this is what has been stated in the code comment below 
> (patterns.scala:206). However things didn't work as expected.
> {code}
>   // A single aggregate expression might appear multiple times in 
> resultExpressions.
>   // In order to avoid evaluating an individual aggregate function 
> multiple times, we'll
>   // build a set of the distinct aggregate expressions and build a 
> function which can
>   // be used to re-write expressions so that they reference the single 
> copy of the
>   // aggregate function which actually gets computed.
> {code}
> For example, the physical plan of
> {code}
> SELECT a, max(b+1), max(b+1) + 1 FROM testData2 GROUP BY a
> {code}
> was
> {code}
> HashAggregate(keys=[a#23], functions=[max((b#24 + 1)), max((b#24 + 1))], 
> output=[a#23, max((b + 1))#223, (max((b + 1)) + 1)#224])
> +- HashAggregate(keys=[a#23], functions=[partial_max((b#24 + 1)), 
> partial_max((b#24 + 1))], output=[a#23, max#231, max#232])
>+- SerializeFromObject [assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData2, true]).a AS a#23, 
> assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, 
> true]).b AS b#24]
>   +- Scan ExternalRDDScan[obj#22]
> {code}
> , where in each HashAggregate there were two identical aggregate functions 
> "max(b#24 + 1)".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22257) Reserve all non-deterministic expressions in ExpressionSet.

2017-10-12 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22257.
-
   Resolution: Fixed
 Assignee: Gengliang Wang
Fix Version/s: 2.3.0

> Reserve all non-deterministic expressions in ExpressionSet.
> ---
>
> Key: SPARK-22257
> URL: https://issues.apache.org/jira/browse/SPARK-22257
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Critical
> Fix For: 2.3.0
>
>
> For non-deterministic expressions, they should be considered as not contained 
> in the [[ExpressionSet]].
>  
> This is consistent with how we define `semanticEquals` between two 
> expressions.
> Otherwise, combining expressions will remove non-deterministic expressions 
> which should be reserved.
> E.g
> Combine filters of 
> ```
>   testRelation.where(Rand(0) > 0.1).where(Rand(0) > 0.1)
> ```
> should result in
> ```
>   testRelation.where(Rand(0) > 0.1 && Rand(0) > 0.1)
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22270) Renaming DF column breaks sparkPlan.outputOrdering

2017-10-12 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203081#comment-16203081
 ] 

Takeshi Yamamuro commented on SPARK-22270:
--

Probably, this is duplicate to 
https://issues.apache.org/jira/browse/SPARK-19981. Currently, aliases drop 
output partitioning and ordering.

> Renaming DF column breaks sparkPlan.outputOrdering
> --
>
> Key: SPARK-22270
> URL: https://issues.apache.org/jira/browse/SPARK-22270
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Yuri Bogomolov
>
> Renaming columns doesn't update ordering/distribution metadata. This may 
> cause unnecessary data shuffles, and significantly affect performance.
> {code:java}
> val df = spark.sqlContext.range(0, 10)
> val sorted = df.sort("id")
> val renamed = sorted.withColumnRenamed("id", "id2")
> val sortedAgain = renamed.sort("id2")
> sortedAgain.explain(true)
> == Analyzed Logical Plan ==
> id2: bigint
> Sort [id2#6L ASC NULLS FIRST], true
> +- Project [id#0L AS id2#6L]
>+- Sort [id#0L ASC NULLS FIRST], true
>   +- Range (0, 10, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Sort [id2#6L ASC NULLS FIRST], true
> +- Project [id#0L AS id2#6L]
>+- Sort [id#0L ASC NULLS FIRST], true
>   +- Range (0, 10, step=1, splits=Some(4))
> == Physical Plan ==
> *Sort [id2#6L ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(id2#6L ASC NULLS FIRST, 200)
>+- *Project [id#0L AS id2#6L]
>   +- *Sort [id#0L ASC NULLS FIRST], true, 0
>  +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
> +- *Range (0, 10, step=1, splits=4)
> {code}
> You can see that the dataset is going to be sorted twice.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22271) Describe results in "null" for the value of "mean" of a numeric variable

2017-10-12 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203080#comment-16203080
 ] 

Takeshi Yamamuro commented on SPARK-22271:
--

You need to give us the data and the schema, too?

> Describe results in "null" for the value of "mean" of a numeric variable
> 
>
> Key: SPARK-22271
> URL: https://issues.apache.org/jira/browse/SPARK-22271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: 
>Reporter: Shafique Jamal
>Priority: Minor
>
> Please excuse me if this issue was addressed already - I was unable to find 
> it.
> Calling .describe().show() on my dataframe results in a value of null for the 
> row "mean":
> {noformat}
> val foo = spark.read.parquet("decimalNumbers.parquet")
> foo.select(col("numericvariable")).describe().show()
> foo: org.apache.spark.sql.DataFrame = [numericvariable: decimal(38,32)]
> +---++
> |summary| numericvariable|
> +---++
> |  count| 299|
> |   mean|null|
> | stddev|  0.2376438793946738|
> |min|0.037815489727642...|
> |max|2.138189366554511...|
> {noformat}
> But all of the rows for this seem ok (I can attache a parquet file). When I 
> round the column, however, all is fine:
> {noformat}
> foo.select(bround(col("numericvariable"), 31)).describe().show()
> +---+---+
> |summary|bround(numericvariable, 31)|
> +---+---+
> |  count|299|
> |   mean|   0.139522503183236...|
> | stddev| 0.2376438793946738|
> |min|   0.037815489727642...|
> |max|   2.138189366554511...|
> +---+---+
> {noformat}
> Rounding using 32 gives null also though.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16060) Vectorized Orc reader

2017-10-12 Thread Rajiv Chodisetti (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203064#comment-16203064
 ] 

Rajiv Chodisetti edited comment on SPARK-16060 at 10/13/17 5:19 AM:


Will this fix the below issue as well , where reading metadata of ORC files is 
very slow when input data storage is S3 ?

https://forums.databricks.com/questions/12499/parquet-vs-orc-s3-metadata-read-performance.html


was (Author: rajivchodisetti):
Will this fix the below issue as well , where reading ORC metadata reading is 
very slow when input data storage is S3 ?

https://forums.databricks.com/questions/12499/parquet-vs-orc-s3-metadata-read-performance.html

> Vectorized Orc reader
> -
>
> Key: SPARK-16060
> URL: https://issues.apache.org/jira/browse/SPARK-16060
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> Currently Orc reader in Spark SQL doesn't support vectorized reading. As Hive 
> Orc already support vectorization, we should add this support to improve Orc 
> reading performance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22271) Describe results in "null" for the value of "mean" of a numeric variable

2017-10-12 Thread Shafique Jamal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shafique Jamal updated SPARK-22271:
---
Description: 
Please excuse me if this issue was addressed already - I was unable to find it.

Calling .describe().show() on my dataframe results in a value of null for the 
row "mean":


{noformat}
val foo = spark.read.parquet("decimalNumbers.parquet")
foo.select(col("numericvariable")).describe().show()

foo: org.apache.spark.sql.DataFrame = [numericvariable: decimal(38,32)]
+---++
|summary| numericvariable|
+---++
|  count| 299|
|   mean|null|
| stddev|  0.2376438793946738|
|min|0.037815489727642...|
|max|2.138189366554511...|
{noformat}


But all of the rows for this seem ok (I can attache a parquet file). When I 
round the column, however, all is fine:


{noformat}
foo.select(bround(col("numericvariable"), 31)).describe().show()


+---+---+
|summary|bround(numericvariable, 31)|
+---+---+
|  count|299|
|   mean|   0.139522503183236...|
| stddev| 0.2376438793946738|
|min|   0.037815489727642...|
|max|   2.138189366554511...|
+---+---+

{noformat}

Rounding using 32 gives null also though.

  was:
Please excuse me if this issue was addressed already - I was unable to find it.

Calling .describe().show() on my dataframe results in a value of null for the 
row "mean":

{{val foo = spark.read.parquet("decimalNumbers.parquet")
foo.select(col("numericvariable")).describe().show()

foo: org.apache.spark.sql.DataFrame = [numericvariable: decimal(38,32)]
+---++
|summary| numericvariable|
+---++
|  count| 299|
|   mean|null|
| stddev|  0.2376438793946738|
|min|0.037815489727642...|
|max|2.138189366554511...|}}

But all of the rows for this seem ok (I can attache a parquet file). When I 
round the column, however, all is fine:

{{foo.select(bround(col("numericvariable"), 31)).describe().show()


+---+---+
|summary|bround(numericvariable, 31)|
+---+---+
|  count|299|
|   mean|   0.139522503183236...|
| stddev| 0.2376438793946738|
|min|   0.037815489727642...|
|max|   2.138189366554511...|
+---+---+}}

Rounding to 32 give null also though.


> Describe results in "null" for the value of "mean" of a numeric variable
> 
>
> Key: SPARK-22271
> URL: https://issues.apache.org/jira/browse/SPARK-22271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: 
>Reporter: Shafique Jamal
>Priority: Minor
>
> Please excuse me if this issue was addressed already - I was unable to find 
> it.
> Calling .describe().show() on my dataframe results in a value of null for the 
> row "mean":
> {noformat}
> val foo = spark.read.parquet("decimalNumbers.parquet")
> foo.select(col("numericvariable")).describe().show()
> foo: org.apache.spark.sql.DataFrame = [numericvariable: decimal(38,32)]
> +---++
> |summary| numericvariable|
> +---++
> |  count| 299|
> |   mean|null|
> | stddev|  0.2376438793946738|
> |min|0.037815489727642...|
> |max|2.138189366554511...|
> {noformat}
> But all of the rows for this seem ok (I can attache a parquet file). When I 
> round the column, however, all is fine:
> {noformat}
> foo.select(bround(col("numericvariable"), 31)).describe().show()
> +---+---+
> |summary|bround(numericvariable, 31)|
> +---+---+
> |  count|299|
> |   mean|   0.139522503183236...|
> | stddev| 0.2376438793946738|
> |min|   0.037815489727642...|
> |max|   2.138189366554511...|
> +---+---+
> {noformat}
> Rounding using 32 gives null also though.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16060) Vectorized Orc reader

2017-10-12 Thread Rajiv Chodisetti (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203064#comment-16203064
 ] 

Rajiv Chodisetti edited comment on SPARK-16060 at 10/13/17 5:18 AM:


Will this fix the below issue as well , where reading ORC metadata reading is 
very slow when input data storage is S3 ?

https://forums.databricks.com/questions/12499/parquet-vs-orc-s3-metadata-read-performance.html


was (Author: rajivchodisetti):
Will this fix, this issue the below issue as well , where reading ORC metadata 
reading is very slow when input data storage is S3 ?

https://forums.databricks.com/questions/12499/parquet-vs-orc-s3-metadata-read-performance.html

> Vectorized Orc reader
> -
>
> Key: SPARK-16060
> URL: https://issues.apache.org/jira/browse/SPARK-16060
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> Currently Orc reader in Spark SQL doesn't support vectorized reading. As Hive 
> Orc already support vectorization, we should add this support to improve Orc 
> reading performance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16060) Vectorized Orc reader

2017-10-12 Thread Rajiv Chodisetti (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203064#comment-16203064
 ] 

Rajiv Chodisetti commented on SPARK-16060:
--

Will this fix, this issue the below issue as well , where reading ORC metadata 
reading is very slow when input data storage is S3 ?

https://forums.databricks.com/questions/12499/parquet-vs-orc-s3-metadata-read-performance.html

> Vectorized Orc reader
> -
>
> Key: SPARK-16060
> URL: https://issues.apache.org/jira/browse/SPARK-16060
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> Currently Orc reader in Spark SQL doesn't support vectorized reading. As Hive 
> Orc already support vectorization, we should add this support to improve Orc 
> reading performance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22271) Describe results in "null" for the value of "mean" of a numeric variable

2017-10-12 Thread Shafique Jamal (JIRA)
Shafique Jamal created SPARK-22271:
--

 Summary: Describe results in "null" for the value of "mean" of a 
numeric variable
 Key: SPARK-22271
 URL: https://issues.apache.org/jira/browse/SPARK-22271
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
 Environment: 

Reporter: Shafique Jamal
Priority: Minor


Please excuse me if this issue was addressed already - I was unable to find it.

Calling .describe().show() on my dataframe results in a value of null for the 
row "mean":

{{val foo = spark.read.parquet("decimalNumbers.parquet")
foo.select(col("numericvariable")).describe().show()

foo: org.apache.spark.sql.DataFrame = [numericvariable: decimal(38,32)]
+---++
|summary| numericvariable|
+---++
|  count| 299|
|   mean|null|
| stddev|  0.2376438793946738|
|min|0.037815489727642...|
|max|2.138189366554511...|}}

But all of the rows for this seem ok (I can attache a parquet file). When I 
round the column, however, all is fine:

{{foo.select(bround(col("numericvariable"), 31)).describe().show()


+---+---+
|summary|bround(numericvariable, 31)|
+---+---+
|  count|299|
|   mean|   0.139522503183236...|
| stddev| 0.2376438793946738|
|min|   0.037815489727642...|
|max|   2.138189366554511...|
+---+---+}}

Rounding to 32 give null also though.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21165) Fail to write into partitioned hive table due to attribute reference not working with cast on partition column

2017-10-12 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-21165:

Fix Version/s: 2.3.0

> Fail to write into partitioned hive table due to attribute reference not 
> working with cast on partition column
> --
>
> Key: SPARK-21165
> URL: https://issues.apache.org/jira/browse/SPARK-21165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Imran Rashid
>Assignee: Xiao Li
>Priority: Blocker
> Fix For: 2.2.0, 2.3.0
>
>
> A simple "insert into ... select" involving partitioned hive tables fails.  
> Here's a simpler repro which doesn't involve hive at all -- this succeeds on 
> 2.1.1, but fails on 2.2.0-rc5:
> {noformat}
> spark.sql("""SET hive.exec.dynamic.partition.mode=nonstrict""")
> spark.sql("""DROP TABLE IF EXISTS src""")
> spark.sql("""DROP TABLE IF EXISTS dest""")
> spark.sql("""
> CREATE TABLE src (first string, word string)
>   PARTITIONED BY (length int)
> """)
> spark.sql("""
> INSERT INTO src PARTITION(length) VALUES
>   ('a', 'abc', 3),
>   ('b', 'bcde', 4),
>   ('c', 'cdefg', 5)
> """)
> spark.sql("""
>   CREATE TABLE dest (word string, length int)
> PARTITIONED BY (first string)
> """)
> spark.sql("""
>   INSERT INTO TABLE dest PARTITION(first) SELECT word, length, cast(first as 
> string) as first FROM src
> """)
> {noformat}
> The exception is
> {noformat}
> 17/06/21 14:25:53 WARN TaskSetManager: Lost task 1.0 in stage 4.0 (TID 10, 
> localhost, executor driver): 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> , tree: first#74
> at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$$anonfun$bind$1.apply(GenerateOrdering.scala:49)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$$anonfun$bind$1.apply(GenerateOrdering.scala:49)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.bind(GenerateOrdering.scala:49)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.bind(GenerateOrdering.scala:43)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:884)
> at 
> org.apache.spark.sql.execution.SparkPlan.newOrdering(SparkPlan.scala:363)
> at 
> org.apache.spark.sql.execution.SortExec.createSorter(SortExec.scala:63)
> at 
> org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:102)
> at 
> org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:101)
> at 
> 

[jira] [Updated] (SPARK-22252) FileFormatWriter should respect the input query schema

2017-10-12 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22252:

Fix Version/s: 2.2.1

> FileFormatWriter should respect the input query schema
> --
>
> Key: SPARK-22252
> URL: https://issues.apache.org/jira/browse/SPARK-22252
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.2.1, 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16628) OrcConversions should not convert an ORC table represented by MetastoreRelation to HadoopFsRelation if metastore schema does not match schema stored in ORC files

2017-10-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203042#comment-16203042
 ] 

Apache Spark commented on SPARK-16628:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/19470

> OrcConversions should not convert an ORC table represented by 
> MetastoreRelation to HadoopFsRelation if metastore schema does not match 
> schema stored in ORC files
> -
>
> Key: SPARK-16628
> URL: https://issues.apache.org/jira/browse/SPARK-16628
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> When {{spark.sql.hive.convertMetastoreOrc}} is enabled, we will convert a ORC 
> table represented by a MetastoreRelation to HadoopFsRelation that uses 
> Spark's OrcFileFormat internally. This conversion aims to make table scanning 
> have a better performance since at runtime, the code path to scan 
> HadoopFsRelation's performance is better. However, OrcFileFormat's 
> implementation is based on the assumption that ORC files store their schema 
> with correct column names. However, before Hive 2.0, an ORC table created by 
> Hive does not store column name correctly in the ORC files (HIVE-4243). So, 
> for this kind of ORC datasets, we cannot really convert the code path. 
> Right now, if ORC tables are created by Hive 1.x or 0.x, enabling 
> {{spark.sql.hive.convertMetastoreOrc}} will introduce a runtime exception for 
> non-partitioned ORC tables and drop the metastore schema for partitioned ORC 
> tables.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22270) Renaming DF column breaks sparkPlan.outputOrdering

2017-10-12 Thread Yuri Bogomolov (JIRA)
Yuri Bogomolov created SPARK-22270:
--

 Summary: Renaming DF column breaks sparkPlan.outputOrdering
 Key: SPARK-22270
 URL: https://issues.apache.org/jira/browse/SPARK-22270
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0, 2.1.0
Reporter: Yuri Bogomolov


Renaming columns doesn't update ordering/distribution metadata. This may cause 
unnecessary data shuffles, and significantly affect performance.

{code:java}
val df = spark.sqlContext.range(0, 10)
val sorted = df.sort("id")
val renamed = sorted.withColumnRenamed("id", "id2")
val sortedAgain = renamed.sort("id2")
sortedAgain.explain(true)

== Analyzed Logical Plan ==
id2: bigint
Sort [id2#6L ASC NULLS FIRST], true
+- Project [id#0L AS id2#6L]
   +- Sort [id#0L ASC NULLS FIRST], true
  +- Range (0, 10, step=1, splits=Some(4))

== Optimized Logical Plan ==
Sort [id2#6L ASC NULLS FIRST], true
+- Project [id#0L AS id2#6L]
   +- Sort [id#0L ASC NULLS FIRST], true
  +- Range (0, 10, step=1, splits=Some(4))

== Physical Plan ==
*Sort [id2#6L ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(id2#6L ASC NULLS FIRST, 200)
   +- *Project [id#0L AS id2#6L]
  +- *Sort [id#0L ASC NULLS FIRST], true, 0
 +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
+- *Range (0, 10, step=1, splits=4)
{code}

You can see that the dataset is going to be sorted twice.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22263) Refactor deterministic as lazy value

2017-10-12 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22263:

Priority: Major  (was: Critical)

> Refactor deterministic as lazy value
> 
>
> Key: SPARK-22263
> URL: https://issues.apache.org/jira/browse/SPARK-22263
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
> Fix For: 2.3.0
>
>
> The method `deterministic` is frequently called in optimizer.
> Refactor `deterministic` as lazy value, in order to avoid redundant 
> computations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22263) Refactor deterministic as lazy value

2017-10-12 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22263.
-
   Resolution: Fixed
 Assignee: Gengliang Wang
Fix Version/s: 2.3.0

> Refactor deterministic as lazy value
> 
>
> Key: SPARK-22263
> URL: https://issues.apache.org/jira/browse/SPARK-22263
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Critical
> Fix For: 2.3.0
>
>
> The method `deterministic` is frequently called in optimizer.
> Refactor `deterministic` as lazy value, in order to avoid redundant 
> computations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20928) Continuous Processing Mode for Structured Streaming

2017-10-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202924#comment-16202924
 ] 

Reynold Xin edited comment on SPARK-20928 at 10/13/17 1:40 AM:
---

OK got it - you are basically saying if we can send the offset associated with 
each record (or a batch of records) to the sink, then the sink can potentially 
implement some sort of dedup to guarantee idempotency. For most sinks this 
probably won't work, but if a particular sink offers a way to do it, then 
end-to-end exactly once can be accomplished.




was (Author: rxin):
OK got it - you are basically saying if we can send the offset associated with 
each record to the sink, then the sink can potentially implement some sort of 
dedup to guarantee idempotency. For most sinks this probably won't work, but if 
a particular sink offers a way to do it, then end-to-end exactly once can be 
accomplished.



> Continuous Processing Mode for Structured Streaming
> ---
>
> Key: SPARK-20928
> URL: https://issues.apache.org/jira/browse/SPARK-20928
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Michael Armbrust
>  Labels: SPIP
>
> Given the current Source API, the minimum possible latency for any record is 
> bounded by the amount of time that it takes to launch a task.  This 
> limitation is a result of the fact that {{getBatch}} requires us to know both 
> the starting and the ending offset, before any tasks are launched.  In the 
> worst case, the end-to-end latency is actually closer to the average batch 
> time + task launching time.
> For applications where latency is more important than exactly-once output 
> however, it would be useful if processing could happen continuously.  This 
> would allow us to achieve fully pipelined reading and writing from sources 
> such as Kafka.  This kind of architecture would make it possible to process 
> records with end-to-end latencies on the order of 1 ms, rather than the 
> 10-100ms that is possible today.
> One possible architecture here would be to change the Source API to look like 
> the following rough sketch:
> {code}
>   trait Epoch {
> def data: DataFrame
> /** The exclusive starting position for `data`. */
> def startOffset: Offset
> /** The inclusive ending position for `data`.  Incrementally updated 
> during processing, but not complete until execution of the query plan in 
> `data` is finished. */
> def endOffset: Offset
>   }
>   def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], 
> limits: Limits): Epoch
> {code}
> The above would allow us to build an alternative implementation of 
> {{StreamExecution}} that processes continuously with much lower latency and 
> only stops processing when needing to reconfigure the stream (either due to a 
> failure or a user requested change in parallelism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20928) Continuous Processing Mode for Structured Streaming

2017-10-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202924#comment-16202924
 ] 

Reynold Xin commented on SPARK-20928:
-

OK got it - you are basically saying if we can send the offset associated with 
each record to the sink, then the sink can potentially implement some sort of 
dedup to guarantee idempotency. For most sinks this probably won't work, but if 
a particular sink offers a way to do it, then end-to-end exactly once can be 
accomplished.



> Continuous Processing Mode for Structured Streaming
> ---
>
> Key: SPARK-20928
> URL: https://issues.apache.org/jira/browse/SPARK-20928
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Michael Armbrust
>  Labels: SPIP
>
> Given the current Source API, the minimum possible latency for any record is 
> bounded by the amount of time that it takes to launch a task.  This 
> limitation is a result of the fact that {{getBatch}} requires us to know both 
> the starting and the ending offset, before any tasks are launched.  In the 
> worst case, the end-to-end latency is actually closer to the average batch 
> time + task launching time.
> For applications where latency is more important than exactly-once output 
> however, it would be useful if processing could happen continuously.  This 
> would allow us to achieve fully pipelined reading and writing from sources 
> such as Kafka.  This kind of architecture would make it possible to process 
> records with end-to-end latencies on the order of 1 ms, rather than the 
> 10-100ms that is possible today.
> One possible architecture here would be to change the Source API to look like 
> the following rough sketch:
> {code}
>   trait Epoch {
> def data: DataFrame
> /** The exclusive starting position for `data`. */
> def startOffset: Offset
> /** The inclusive ending position for `data`.  Incrementally updated 
> during processing, but not complete until execution of the query plan in 
> `data` is finished. */
> def endOffset: Offset
>   }
>   def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], 
> limits: Limits): Epoch
> {code}
> The above would allow us to build an alternative implementation of 
> {{StreamExecution}} that processes continuously with much lower latency and 
> only stops processing when needing to reconfigure the stream (either due to a 
> failure or a user requested change in parallelism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20928) Continuous Processing Mode for Structured Streaming

2017-10-12 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202923#comment-16202923
 ] 

Cody Koeninger commented on SPARK-20928:


If a given sink is handling a result, why does handling the
corresponding offset to the result substantially increase overhead?

Thinking about it in terms of a downstream database, if I'm doing a
write per result, then the difference between writing (result) and
writing (result, offset) seems like it should be overshadowed by the
overall cost of the write.

In more practical terms, poll() on the kafka consumer is returning a
batch of pre-fetched messages anyway, not a single message, so one
should be able to run their straight line map/filter/whatever on the
batch and then commit results with the last offset.




> Continuous Processing Mode for Structured Streaming
> ---
>
> Key: SPARK-20928
> URL: https://issues.apache.org/jira/browse/SPARK-20928
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Michael Armbrust
>  Labels: SPIP
>
> Given the current Source API, the minimum possible latency for any record is 
> bounded by the amount of time that it takes to launch a task.  This 
> limitation is a result of the fact that {{getBatch}} requires us to know both 
> the starting and the ending offset, before any tasks are launched.  In the 
> worst case, the end-to-end latency is actually closer to the average batch 
> time + task launching time.
> For applications where latency is more important than exactly-once output 
> however, it would be useful if processing could happen continuously.  This 
> would allow us to achieve fully pipelined reading and writing from sources 
> such as Kafka.  This kind of architecture would make it possible to process 
> records with end-to-end latencies on the order of 1 ms, rather than the 
> 10-100ms that is possible today.
> One possible architecture here would be to change the Source API to look like 
> the following rough sketch:
> {code}
>   trait Epoch {
> def data: DataFrame
> /** The exclusive starting position for `data`. */
> def startOffset: Offset
> /** The inclusive ending position for `data`.  Incrementally updated 
> during processing, but not complete until execution of the query plan in 
> `data` is finished. */
> def endOffset: Offset
>   }
>   def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], 
> limits: Limits): Epoch
> {code}
> The above would allow us to build an alternative implementation of 
> {{StreamExecution}} that processes continuously with much lower latency and 
> only stops processing when needing to reconfigure the stream (either due to a 
> failure or a user requested change in parallelism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22229) SPIP: RDMA Accelerated Shuffle Engine

2017-10-12 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202900#comment-16202900
 ] 

Saisai Shao commented on SPARK-9:
-

{quote}
I don't think that limited familiarity with a new promising feature is a good 
enough reason to avoid it. If every new feature will be treated this way, then 
new technologies will never get introduced to Spark.
{quote}

[~yuvaldeg] I think you might misunderstand my points. I'm not saying that 
Spark will never introduce new technologies, my point is that if the technology 
is not only promising enough, but also has a large amount of audience, of 
course we should bring in it, like k8s support. AFAIK RDMA adoption is not so 
common in big data area.

Just my two cents.

> SPIP: RDMA Accelerated Shuffle Engine
> -
>
> Key: SPARK-9
> URL: https://issues.apache.org/jira/browse/SPARK-9
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Yuval Degani
> Attachments: 
> SPARK-9_SPIP_RDMA_Accelerated_Shuffle_Engine_Rev_1.0.pdf
>
>
> An RDMA-accelerated shuffle engine can provide enormous performance benefits 
> to shuffle-intensive Spark jobs, as demonstrated in the “SparkRDMA” plugin 
> open-source project ([https://github.com/Mellanox/SparkRDMA]).
> Using RDMA for shuffle improves CPU utilization significantly and reduces I/O 
> processing overhead by bypassing the kernel and networking stack as well as 
> avoiding memory copies entirely. Those valuable CPU cycles are then consumed 
> directly by the actual Spark workloads, and help reducing the job runtime 
> significantly. 
> This performance gain is demonstrated with both industry standard HiBench 
> TeraSort (shows 1.5x speedup in sorting) as well as shuffle intensive 
> customer applications. 
> SparkRDMA will be presented at Spark Summit 2017 in Dublin 
> ([https://spark-summit.org/eu-2017/events/accelerating-shuffle-a-tailor-made-rdma-solution-for-apache-spark/]).
> Please see attached proposal document for more information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20928) Continuous Processing Mode for Structured Streaming

2017-10-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202906#comment-16202906
 ] 

Reynold Xin commented on SPARK-20928:
-

Isn't there an issue with the overhead of tracking in the sinks? If we need to 
guarantee exactly once, then each record needs to be committed.

> Continuous Processing Mode for Structured Streaming
> ---
>
> Key: SPARK-20928
> URL: https://issues.apache.org/jira/browse/SPARK-20928
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Michael Armbrust
>  Labels: SPIP
>
> Given the current Source API, the minimum possible latency for any record is 
> bounded by the amount of time that it takes to launch a task.  This 
> limitation is a result of the fact that {{getBatch}} requires us to know both 
> the starting and the ending offset, before any tasks are launched.  In the 
> worst case, the end-to-end latency is actually closer to the average batch 
> time + task launching time.
> For applications where latency is more important than exactly-once output 
> however, it would be useful if processing could happen continuously.  This 
> would allow us to achieve fully pipelined reading and writing from sources 
> such as Kafka.  This kind of architecture would make it possible to process 
> records with end-to-end latencies on the order of 1 ms, rather than the 
> 10-100ms that is possible today.
> One possible architecture here would be to change the Source API to look like 
> the following rough sketch:
> {code}
>   trait Epoch {
> def data: DataFrame
> /** The exclusive starting position for `data`. */
> def startOffset: Offset
> /** The inclusive ending position for `data`.  Incrementally updated 
> during processing, but not complete until execution of the query plan in 
> `data` is finished. */
> def endOffset: Offset
>   }
>   def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], 
> limits: Limits): Epoch
> {code}
> The above would allow us to build an alternative implementation of 
> {{StreamExecution}} that processes continuously with much lower latency and 
> only stops processing when needing to reconfigure the stream (either due to a 
> failure or a user requested change in parallelism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20928) Continuous Processing Mode for Structured Streaming

2017-10-12 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202905#comment-16202905
 ] 

Cody Koeninger commented on SPARK-20928:


I was talking about the specific case of jobs with only narrow stages.  If 
there's no shuffle, then it should be sufficient at any given point to record 
the per-partition offset alongside the result.

> Continuous Processing Mode for Structured Streaming
> ---
>
> Key: SPARK-20928
> URL: https://issues.apache.org/jira/browse/SPARK-20928
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Michael Armbrust
>  Labels: SPIP
>
> Given the current Source API, the minimum possible latency for any record is 
> bounded by the amount of time that it takes to launch a task.  This 
> limitation is a result of the fact that {{getBatch}} requires us to know both 
> the starting and the ending offset, before any tasks are launched.  In the 
> worst case, the end-to-end latency is actually closer to the average batch 
> time + task launching time.
> For applications where latency is more important than exactly-once output 
> however, it would be useful if processing could happen continuously.  This 
> would allow us to achieve fully pipelined reading and writing from sources 
> such as Kafka.  This kind of architecture would make it possible to process 
> records with end-to-end latencies on the order of 1 ms, rather than the 
> 10-100ms that is possible today.
> One possible architecture here would be to change the Source API to look like 
> the following rough sketch:
> {code}
>   trait Epoch {
> def data: DataFrame
> /** The exclusive starting position for `data`. */
> def startOffset: Offset
> /** The inclusive ending position for `data`.  Incrementally updated 
> during processing, but not complete until execution of the query plan in 
> `data` is finished. */
> def endOffset: Offset
>   }
>   def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], 
> limits: Limits): Epoch
> {code}
> The above would allow us to build an alternative implementation of 
> {{StreamExecution}} that processes continuously with much lower latency and 
> only stops processing when needing to reconfigure the stream (either due to a 
> failure or a user requested change in parallelism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22217) ParquetFileFormat to support arbitrary OutputCommitters

2017-10-12 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-22217:
-
Fix Version/s: 2.2.1

> ParquetFileFormat to support arbitrary OutputCommitters
> ---
>
> Key: SPARK-22217
> URL: https://issues.apache.org/jira/browse/SPARK-22217
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 2.2.1, 2.3.0
>
>
> Although you can choose which committer to write dataframes as parquet data 
> via {{spark.sql.parquet.output.committer.class}}, you get a class cast 
> exception if this is not a 
> {{org.apache.parquet.hadoop.ParquetOutputCommitter}} or subclass.
> This is not consistent with the docs in SQLConf, which says
> bq.  The specified class needs to be a subclass of 
> org.apache.hadoop.mapreduce.OutputCommitter.  Typically, it's also a subclass 
> of org.apache.parquet.hadoop.ParquetOutputCommitter.
> It is simple to relax {{ParquetFileFormat}}'s requirements, though if the 
> user has set
> {{parquet.enable.summary-metadata=true}}, and set a committer which is not a 
> ParquetOutputCommitter, then they won't see the data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22266) The same aggregate function was evaluated multiple times

2017-10-12 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202871#comment-16202871
 ] 

Takeshi Yamamuro edited comment on SPARK-22266 at 10/13/17 12:39 AM:
-

This is not a bug, so I changed to improvement. It seems that would be good to 
handle this kind of common exprs also in subexpressionElimination.


was (Author: maropu):
This is not a bug, so I changed to improvement.

> The same aggregate function was evaluated multiple times
> 
>
> Key: SPARK-22266
> URL: https://issues.apache.org/jira/browse/SPARK-22266
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Maryann Xue
>Priority: Minor
>
> We should avoid the same aggregate function being evaluated more than once, 
> and this is what has been stated in the code comment below 
> (patterns.scala:206). However things didn't work as expected.
> {code}
>   // A single aggregate expression might appear multiple times in 
> resultExpressions.
>   // In order to avoid evaluating an individual aggregate function 
> multiple times, we'll
>   // build a set of the distinct aggregate expressions and build a 
> function which can
>   // be used to re-write expressions so that they reference the single 
> copy of the
>   // aggregate function which actually gets computed.
> {code}
> For example, the physical plan of
> {code}
> SELECT a, max(b+1), max(b+1) + 1 FROM testData2 GROUP BY a
> {code}
> was
> {code}
> HashAggregate(keys=[a#23], functions=[max((b#24 + 1)), max((b#24 + 1))], 
> output=[a#23, max((b + 1))#223, (max((b + 1)) + 1)#224])
> +- HashAggregate(keys=[a#23], functions=[partial_max((b#24 + 1)), 
> partial_max((b#24 + 1))], output=[a#23, max#231, max#232])
>+- SerializeFromObject [assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData2, true]).a AS a#23, 
> assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, 
> true]).b AS b#24]
>   +- Scan ExternalRDDScan[obj#22]
> {code}
> , where in each HashAggregate there were two identical aggregate functions 
> "max(b#24 + 1)".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22266) The same aggregate function was evaluated multiple times

2017-10-12 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-22266:
-
Issue Type: Improvement  (was: Bug)

> The same aggregate function was evaluated multiple times
> 
>
> Key: SPARK-22266
> URL: https://issues.apache.org/jira/browse/SPARK-22266
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Maryann Xue
>Priority: Minor
>
> We should avoid the same aggregate function being evaluated more than once, 
> and this is what has been stated in the code comment below 
> (patterns.scala:206). However things didn't work as expected.
> {code}
>   // A single aggregate expression might appear multiple times in 
> resultExpressions.
>   // In order to avoid evaluating an individual aggregate function 
> multiple times, we'll
>   // build a set of the distinct aggregate expressions and build a 
> function which can
>   // be used to re-write expressions so that they reference the single 
> copy of the
>   // aggregate function which actually gets computed.
> {code}
> For example, the physical plan of
> {code}
> SELECT a, max(b+1), max(b+1) + 1 FROM testData2 GROUP BY a
> {code}
> was
> {code}
> HashAggregate(keys=[a#23], functions=[max((b#24 + 1)), max((b#24 + 1))], 
> output=[a#23, max((b + 1))#223, (max((b + 1)) + 1)#224])
> +- HashAggregate(keys=[a#23], functions=[partial_max((b#24 + 1)), 
> partial_max((b#24 + 1))], output=[a#23, max#231, max#232])
>+- SerializeFromObject [assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData2, true]).a AS a#23, 
> assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, 
> true]).b AS b#24]
>   +- Scan ExternalRDDScan[obj#22]
> {code}
> , where in each HashAggregate there were two identical aggregate functions 
> "max(b#24 + 1)".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22266) The same aggregate function was evaluated multiple times

2017-10-12 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202871#comment-16202871
 ] 

Takeshi Yamamuro commented on SPARK-22266:
--

This is not a bug, so I changed to improvement.

> The same aggregate function was evaluated multiple times
> 
>
> Key: SPARK-22266
> URL: https://issues.apache.org/jira/browse/SPARK-22266
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Maryann Xue
>Priority: Minor
>
> We should avoid the same aggregate function being evaluated more than once, 
> and this is what has been stated in the code comment below 
> (patterns.scala:206). However things didn't work as expected.
> {code}
>   // A single aggregate expression might appear multiple times in 
> resultExpressions.
>   // In order to avoid evaluating an individual aggregate function 
> multiple times, we'll
>   // build a set of the distinct aggregate expressions and build a 
> function which can
>   // be used to re-write expressions so that they reference the single 
> copy of the
>   // aggregate function which actually gets computed.
> {code}
> For example, the physical plan of
> {code}
> SELECT a, max(b+1), max(b+1) + 1 FROM testData2 GROUP BY a
> {code}
> was
> {code}
> HashAggregate(keys=[a#23], functions=[max((b#24 + 1)), max((b#24 + 1))], 
> output=[a#23, max((b + 1))#223, (max((b + 1)) + 1)#224])
> +- HashAggregate(keys=[a#23], functions=[partial_max((b#24 + 1)), 
> partial_max((b#24 + 1))], output=[a#23, max#231, max#232])
>+- SerializeFromObject [assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData2, true]).a AS a#23, 
> assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, 
> true]).b AS b#24]
>   +- Scan ExternalRDDScan[obj#22]
> {code}
> , where in each HashAggregate there were two identical aggregate functions 
> "max(b#24 + 1)".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21549) Spark fails to complete job correctly in case of OutputFormat which do not write into hdfs

2017-10-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202865#comment-16202865
 ] 

Apache Spark commented on SPARK-21549:
--

User 'mridulm' has created a pull request for this issue:
https://github.com/apache/spark/pull/19487

> Spark fails to complete job correctly in case of OutputFormat which do not 
> write into hdfs
> --
>
> Key: SPARK-21549
> URL: https://issues.apache.org/jira/browse/SPARK-21549
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: spark 2.2.0
> scala 2.11
>Reporter: Sergey Zhemzhitsky
> Fix For: 2.2.1, 2.3.0
>
>
> Spark fails to complete job correctly in case of custom OutputFormat 
> implementations.
> There are OutputFormat implementations which do not need to use 
> *mapreduce.output.fileoutputformat.outputdir* standard hadoop property.
> [But spark reads this property from the 
> configuration|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L79]
>  while setting up an OutputCommitter
> {code:javascript}
> val committer = FileCommitProtocol.instantiate(
>   className = classOf[HadoopMapReduceCommitProtocol].getName,
>   jobId = stageId.toString,
>   outputPath = conf.value.get("mapreduce.output.fileoutputformat.outputdir"),
>   isAppend = false).asInstanceOf[HadoopMapReduceCommitProtocol]
> committer.setupJob(jobContext)
> {code}
> ... and then uses this property later on while [commiting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L132],
>  [aborting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L141],
>  [creating task's temp 
> path|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L95]
> In that cases when the job completes then following exception is thrown
> {code}
> Can not create a Path from a null string
> java.lang.IllegalArgumentException: Can not create a Path from a null string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123)
>   at org.apache.hadoop.fs.Path.(Path.java:135)
>   at org.apache.hadoop.fs.Path.(Path.java:89)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
>   ...
> {code}
> So it seems that all the jobs which use OutputFormats which don't write data 
> into HDFS-compatible file systems are broken.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results

2017-10-12 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202863#comment-16202863
 ] 

Takeshi Yamamuro commented on SPARK-22211:
--

Aha, I misunderstood. But, I think the case 3 is not acceptable (I know we can 
get big performance gains though...) because the transformation of relational 
expressions must not change results. IIUC, in the case 3, the push-down changes 
the result, right?

> LimitPushDown optimization for FullOuterJoin generates wrong results
> 
>
> Key: SPARK-22211
> URL: https://issues.apache.org/jira/browse/SPARK-22211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: on community.cloude.databrick.com 
> Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11)
>Reporter: Benyi Wang
>
> LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may 
> generate a wrong result:
> Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 
> is selected, but at right side we have 100K rows including 999, the result 
> will be
> - one row is (999, 999)
> - the rest rows are (null, xxx)
> Once you call show(), the row (999,999) has only 1/10th chance to be 
> selected by CollectLimit.
> The actual optimization might be, 
> - push down limit
> - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin.
> Here is my notebook:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html
> {code:java}
> import scala.util.Random._
> val dl = shuffle(1 to 10).toDF("id")
> val dr = shuffle(1 to 10).toDF("id")
> println("data frame dl:")
> dl.explain
> println("data frame dr:")
> dr.explain
> val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1)
> j.explain
> j.show(false)
> {code}
> {code}
> data frame dl:
> == Physical Plan ==
> LocalTableScan [id#10]
> data frame dr:
> == Physical Plan ==
> LocalTableScan [id#16]
> == Physical Plan ==
> CollectLimit 1
> +- SortMergeJoin [id#10], [id#16], FullOuter
>:- *Sort [id#10 ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#10, 200)
>: +- *LocalLimit 1
>:+- LocalTableScan [id#10]
>+- *Sort [id#16 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(id#16, 200)
>  +- LocalTableScan [id#16]
> import scala.util.Random._
> dl: org.apache.spark.sql.DataFrame = [id: int]
> dr: org.apache.spark.sql.DataFrame = [id: int]
> j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int]
> ++---+
> |id  |id |
> ++---+
> |null|148|
> ++---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22217) ParquetFileFormat to support arbitrary OutputCommitters

2017-10-12 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-22217:


Assignee: Steve Loughran

> ParquetFileFormat to support arbitrary OutputCommitters
> ---
>
> Key: SPARK-22217
> URL: https://issues.apache.org/jira/browse/SPARK-22217
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 2.3.0
>
>
> Although you can choose which committer to write dataframes as parquet data 
> via {{spark.sql.parquet.output.committer.class}}, you get a class cast 
> exception if this is not a 
> {{org.apache.parquet.hadoop.ParquetOutputCommitter}} or subclass.
> This is not consistent with the docs in SQLConf, which says
> bq.  The specified class needs to be a subclass of 
> org.apache.hadoop.mapreduce.OutputCommitter.  Typically, it's also a subclass 
> of org.apache.parquet.hadoop.ParquetOutputCommitter.
> It is simple to relax {{ParquetFileFormat}}'s requirements, though if the 
> user has set
> {{parquet.enable.summary-metadata=true}}, and set a committer which is not a 
> ParquetOutputCommitter, then they won't see the data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22217) ParquetFileFormat to support arbitrary OutputCommitters

2017-10-12 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22217.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19448
[https://github.com/apache/spark/pull/19448]

> ParquetFileFormat to support arbitrary OutputCommitters
> ---
>
> Key: SPARK-22217
> URL: https://issues.apache.org/jira/browse/SPARK-22217
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Steve Loughran
>Priority: Minor
> Fix For: 2.3.0
>
>
> Although you can choose which committer to write dataframes as parquet data 
> via {{spark.sql.parquet.output.committer.class}}, you get a class cast 
> exception if this is not a 
> {{org.apache.parquet.hadoop.ParquetOutputCommitter}} or subclass.
> This is not consistent with the docs in SQLConf, which says
> bq.  The specified class needs to be a subclass of 
> org.apache.hadoop.mapreduce.OutputCommitter.  Typically, it's also a subclass 
> of org.apache.parquet.hadoop.ParquetOutputCommitter.
> It is simple to relax {{ParquetFileFormat}}'s requirements, though if the 
> user has set
> {{parquet.enable.summary-metadata=true}}, and set a committer which is not a 
> ParquetOutputCommitter, then they won't see the data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22269) Java style checks should be run in Jenkins

2017-10-12 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202795#comment-16202795
 ] 

Andrew Ash commented on SPARK-22269:


[~sowen] you closed this as a duplicate.  What issue is it a duplicate of?

> Java style checks should be run in Jenkins
> --
>
> Key: SPARK-22269
> URL: https://issues.apache.org/jira/browse/SPARK-22269
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Andrew Ash
>Priority: Minor
>
> A few times now I've gone to build the master branch and it's failed due to 
> Java style errors, which I've sent in PRs to fix:
> - https://issues.apache.org/jira/browse/SPARK-22268
> - https://issues.apache.org/jira/browse/SPARK-21875
> Digging through the history a bit, it looks like this check used to run on 
> Jenkins and was previously enabled at 
> https://github.com/apache/spark/pull/10763 but then reverted at 
> https://github.com/apache/spark/commit/4bcea1b8595424678aa6c92d66ba08c92e0fefe5
> We should work out what it takes to enable the Java check in Jenkins so these 
> kinds of errors are caught in CI rather than afterwards post-merge.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22268) Fix java style errors

2017-10-12 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202793#comment-16202793
 ] 

Andrew Ash commented on SPARK-22268:


Any time {{./dev/run-tests}} is failing I consider that a bug.

> Fix java style errors
> -
>
> Key: SPARK-22268
> URL: https://issues.apache.org/jira/browse/SPARK-22268
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Andrew Ash
>Priority: Trivial
>
> {{./dev/lint-java}} fails on master right now with these exceptions:
> {noformat}
> [ERROR] 
> src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[176]
>  (sizes) LineLength: Line is longer than 100 characters (found 112).
> [ERROR] 
> src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[177]
>  (sizes) LineLength: Line is longer than 100 characters (found 106).
> [ERROR] 
> src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[178]
>  (sizes) LineLength: Line is longer than 100 characters (found 136).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[520]
>  (sizes) LineLength: Line is longer than 100 characters (found 104).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[524]
>  (sizes) LineLength: Line is longer than 100 characters (found 123).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[533]
>  (sizes) LineLength: Line is longer than 100 characters (found 120).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[535]
>  (sizes) LineLength: Line is longer than 100 characters (found 114).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorterSuite.java:[182]
>  (sizes) LineLength: Line is longer than 100 characters (found 116).
> [ERROR] 
> src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsPushDownFilters.java:[21,8]
>  (imports) UnusedImports: Unused import - 
> org.apache.spark.sql.catalyst.expressions.Expression.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22269) Java style checks should be run in Jenkins

2017-10-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22269.
---
Resolution: Duplicate

> Java style checks should be run in Jenkins
> --
>
> Key: SPARK-22269
> URL: https://issues.apache.org/jira/browse/SPARK-22269
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Andrew Ash
>Priority: Minor
>
> A few times now I've gone to build the master branch and it's failed due to 
> Java style errors, which I've sent in PRs to fix:
> - https://issues.apache.org/jira/browse/SPARK-22268
> - https://issues.apache.org/jira/browse/SPARK-21875
> Digging through the history a bit, it looks like this check used to run on 
> Jenkins and was previously enabled at 
> https://github.com/apache/spark/pull/10763 but then reverted at 
> https://github.com/apache/spark/commit/4bcea1b8595424678aa6c92d66ba08c92e0fefe5
> We should work out what it takes to enable the Java check in Jenkins so these 
> kinds of errors are caught in CI rather than afterwards post-merge.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22269) Java style checks should be run in Jenkins

2017-10-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22269:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

Also not a bug. We've tried this before and the problem is it doesn't work with 
SBT.

> Java style checks should be run in Jenkins
> --
>
> Key: SPARK-22269
> URL: https://issues.apache.org/jira/browse/SPARK-22269
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Andrew Ash
>Priority: Minor
>
> A few times now I've gone to build the master branch and it's failed due to 
> Java style errors, which I've sent in PRs to fix:
> - https://issues.apache.org/jira/browse/SPARK-22268
> - https://issues.apache.org/jira/browse/SPARK-21875
> Digging through the history a bit, it looks like this check used to run on 
> Jenkins and was previously enabled at 
> https://github.com/apache/spark/pull/10763 but then reverted at 
> https://github.com/apache/spark/commit/4bcea1b8595424678aa6c92d66ba08c92e0fefe5
> We should work out what it takes to enable the Java check in Jenkins so these 
> kinds of errors are caught in CI rather than afterwards post-merge.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22268) Fix java style errors

2017-10-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22268:
--
  Priority: Trivial  (was: Major)
Issue Type: Improvement  (was: Bug)

OK, don't make JIRAs for these though. These aren't bugs.

> Fix java style errors
> -
>
> Key: SPARK-22268
> URL: https://issues.apache.org/jira/browse/SPARK-22268
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Andrew Ash
>Priority: Trivial
>
> {{./dev/lint-java}} fails on master right now with these exceptions:
> {noformat}
> [ERROR] 
> src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[176]
>  (sizes) LineLength: Line is longer than 100 characters (found 112).
> [ERROR] 
> src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[177]
>  (sizes) LineLength: Line is longer than 100 characters (found 106).
> [ERROR] 
> src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[178]
>  (sizes) LineLength: Line is longer than 100 characters (found 136).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[520]
>  (sizes) LineLength: Line is longer than 100 characters (found 104).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[524]
>  (sizes) LineLength: Line is longer than 100 characters (found 123).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[533]
>  (sizes) LineLength: Line is longer than 100 characters (found 120).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[535]
>  (sizes) LineLength: Line is longer than 100 characters (found 114).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorterSuite.java:[182]
>  (sizes) LineLength: Line is longer than 100 characters (found 116).
> [ERROR] 
> src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsPushDownFilters.java:[21,8]
>  (imports) UnusedImports: Unused import - 
> org.apache.spark.sql.catalyst.expressions.Expression.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22269) Java style checks should be run in Jenkins

2017-10-12 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202710#comment-16202710
 ] 

Marcelo Vanzin commented on SPARK-22269:


I've heard in the past that this wasn't desired because `lint-java` uses maven. 
Perhaps it's time to try something like 
https://github.com/etsy/sbt-checkstyle-plugin

> Java style checks should be run in Jenkins
> --
>
> Key: SPARK-22269
> URL: https://issues.apache.org/jira/browse/SPARK-22269
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Andrew Ash
>
> A few times now I've gone to build the master branch and it's failed due to 
> Java style errors, which I've sent in PRs to fix:
> - https://issues.apache.org/jira/browse/SPARK-22268
> - https://issues.apache.org/jira/browse/SPARK-21875
> Digging through the history a bit, it looks like this check used to run on 
> Jenkins and was previously enabled at 
> https://github.com/apache/spark/pull/10763 but then reverted at 
> https://github.com/apache/spark/commit/4bcea1b8595424678aa6c92d66ba08c92e0fefe5
> We should work out what it takes to enable the Java check in Jenkins so these 
> kinds of errors are caught in CI rather than afterwards post-merge.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22269) Java style checks should be run in Jenkins

2017-10-12 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-22269:
--

 Summary: Java style checks should be run in Jenkins
 Key: SPARK-22269
 URL: https://issues.apache.org/jira/browse/SPARK-22269
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.3.0
Reporter: Andrew Ash


A few times now I've gone to build the master branch and it's failed due to 
Java style errors, which I've sent in PRs to fix:

- https://issues.apache.org/jira/browse/SPARK-22268
- https://issues.apache.org/jira/browse/SPARK-21875

Digging through the history a bit, it looks like this check used to run on 
Jenkins and was previously enabled at 
https://github.com/apache/spark/pull/10763 but then reverted at 
https://github.com/apache/spark/commit/4bcea1b8595424678aa6c92d66ba08c92e0fefe5

We should work out what it takes to enable the Java check in Jenkins so these 
kinds of errors are caught in CI rather than afterwards post-merge.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22268) Fix java style errors

2017-10-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22268:


Assignee: Apache Spark

> Fix java style errors
> -
>
> Key: SPARK-22268
> URL: https://issues.apache.org/jira/browse/SPARK-22268
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Andrew Ash
>Assignee: Apache Spark
>
> {{./dev/lint-java}} fails on master right now with these exceptions:
> {noformat}
> [ERROR] 
> src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[176]
>  (sizes) LineLength: Line is longer than 100 characters (found 112).
> [ERROR] 
> src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[177]
>  (sizes) LineLength: Line is longer than 100 characters (found 106).
> [ERROR] 
> src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[178]
>  (sizes) LineLength: Line is longer than 100 characters (found 136).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[520]
>  (sizes) LineLength: Line is longer than 100 characters (found 104).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[524]
>  (sizes) LineLength: Line is longer than 100 characters (found 123).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[533]
>  (sizes) LineLength: Line is longer than 100 characters (found 120).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[535]
>  (sizes) LineLength: Line is longer than 100 characters (found 114).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorterSuite.java:[182]
>  (sizes) LineLength: Line is longer than 100 characters (found 116).
> [ERROR] 
> src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsPushDownFilters.java:[21,8]
>  (imports) UnusedImports: Unused import - 
> org.apache.spark.sql.catalyst.expressions.Expression.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22268) Fix java style errors

2017-10-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22268:


Assignee: (was: Apache Spark)

> Fix java style errors
> -
>
> Key: SPARK-22268
> URL: https://issues.apache.org/jira/browse/SPARK-22268
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Andrew Ash
>
> {{./dev/lint-java}} fails on master right now with these exceptions:
> {noformat}
> [ERROR] 
> src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[176]
>  (sizes) LineLength: Line is longer than 100 characters (found 112).
> [ERROR] 
> src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[177]
>  (sizes) LineLength: Line is longer than 100 characters (found 106).
> [ERROR] 
> src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[178]
>  (sizes) LineLength: Line is longer than 100 characters (found 136).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[520]
>  (sizes) LineLength: Line is longer than 100 characters (found 104).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[524]
>  (sizes) LineLength: Line is longer than 100 characters (found 123).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[533]
>  (sizes) LineLength: Line is longer than 100 characters (found 120).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[535]
>  (sizes) LineLength: Line is longer than 100 characters (found 114).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorterSuite.java:[182]
>  (sizes) LineLength: Line is longer than 100 characters (found 116).
> [ERROR] 
> src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsPushDownFilters.java:[21,8]
>  (imports) UnusedImports: Unused import - 
> org.apache.spark.sql.catalyst.expressions.Expression.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22268) Fix java style errors

2017-10-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202697#comment-16202697
 ] 

Apache Spark commented on SPARK-22268:
--

User 'ash211' has created a pull request for this issue:
https://github.com/apache/spark/pull/19486

> Fix java style errors
> -
>
> Key: SPARK-22268
> URL: https://issues.apache.org/jira/browse/SPARK-22268
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Andrew Ash
>
> {{./dev/lint-java}} fails on master right now with these exceptions:
> {noformat}
> [ERROR] 
> src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[176]
>  (sizes) LineLength: Line is longer than 100 characters (found 112).
> [ERROR] 
> src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[177]
>  (sizes) LineLength: Line is longer than 100 characters (found 106).
> [ERROR] 
> src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[178]
>  (sizes) LineLength: Line is longer than 100 characters (found 136).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[520]
>  (sizes) LineLength: Line is longer than 100 characters (found 104).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[524]
>  (sizes) LineLength: Line is longer than 100 characters (found 123).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[533]
>  (sizes) LineLength: Line is longer than 100 characters (found 120).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[535]
>  (sizes) LineLength: Line is longer than 100 characters (found 114).
> [ERROR] 
> src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorterSuite.java:[182]
>  (sizes) LineLength: Line is longer than 100 characters (found 116).
> [ERROR] 
> src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsPushDownFilters.java:[21,8]
>  (imports) UnusedImports: Unused import - 
> org.apache.spark.sql.catalyst.expressions.Expression.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22268) Fix java style errors

2017-10-12 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-22268:
--

 Summary: Fix java style errors
 Key: SPARK-22268
 URL: https://issues.apache.org/jira/browse/SPARK-22268
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Andrew Ash


{{./dev/lint-java}} fails on master right now with these exceptions:

{noformat}
[ERROR] 
src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[176]
 (sizes) LineLength: Line is longer than 100 characters (found 112).
[ERROR] 
src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[177]
 (sizes) LineLength: Line is longer than 100 characters (found 106).
[ERROR] 
src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[178]
 (sizes) LineLength: Line is longer than 100 characters (found 136).
[ERROR] 
src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[520]
 (sizes) LineLength: Line is longer than 100 characters (found 104).
[ERROR] 
src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[524]
 (sizes) LineLength: Line is longer than 100 characters (found 123).
[ERROR] 
src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[533]
 (sizes) LineLength: Line is longer than 100 characters (found 120).
[ERROR] 
src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[535]
 (sizes) LineLength: Line is longer than 100 characters (found 114).
[ERROR] 
src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorterSuite.java:[182]
 (sizes) LineLength: Line is longer than 100 characters (found 116).
[ERROR] 
src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsPushDownFilters.java:[21,8]
 (imports) UnusedImports: Unused import - 
org.apache.spark.sql.catalyst.expressions.Expression.
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22240) S3 CSV number of partitions incorrectly computed

2017-10-12 Thread Arthur Baudry (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202691#comment-16202691
 ] 

Arthur Baudry commented on SPARK-22240:
---

[~hyukjin.kwon] Yes it is a single file so even counting the # of files 
wouldn't work in this particular case.

Spark 2.0.2 has builtin support for multiline without the option so I guess 
having only one partition in Spark 2.2 is kind of fail-safe mechanism and we 
are just lucky to have never encountered any problems with our files when 
reading multiline records in Spark 2.0.2. 

If it's any help I also tried with s3n and it's the same thing. Didn't try with 
HDFS as I am only interacting with S3 at the moment. If I have a moment I shall 
try.

Thanks for your help

> S3 CSV number of partitions incorrectly computed
> 
>
> Key: SPARK-22240
> URL: https://issues.apache.org/jira/browse/SPARK-22240
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: Running on EMR 5.8.0 with Hadoop 2.7.3 and Spark 2.2.0
>Reporter: Arthur Baudry
>
> Reading CSV out of S3 using S3A protocol does not compute the number of 
> partitions correctly in Spark 2.2.0.
> With Spark 2.2.0 I get only partition when loading a 14GB file
> {code:java}
> scala> val input = spark.read.format("csv").option("header", 
> "true").option("delimiter", "|").option("multiLine", 
> "true").load("s3a://")
> input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: 
> string ... 36 more fields]
> scala> input.rdd.getNumPartitions
> res2: Int = 1
> {code}
> While in Spark 2.0.2 I had:
> {code:java}
> scala> val input = spark.read.format("csv").option("header", 
> "true").option("delimiter", "|").option("multiLine", 
> "true").load("s3a://")
> input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: 
> string ... 36 more fields]
> scala> input.rdd.getNumPartitions
> res2: Int = 115
> {code}
> This introduces obvious performance issues in Spark 2.2.0. Maybe there is a 
> property that should be set to have the number of partitions computed 
> correctly.
> I'm aware that the .option("multiline","true") is not supported in Spark 
> 2.0.2, it's not relevant here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20928) Continuous Processing Mode for Structured Streaming

2017-10-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202621#comment-16202621
 ] 

Reynold Xin commented on SPARK-20928:
-

[~c...@koeninger.org] can you write down your thoughts on how we should 
maintain exactly once semantics?


> Continuous Processing Mode for Structured Streaming
> ---
>
> Key: SPARK-20928
> URL: https://issues.apache.org/jira/browse/SPARK-20928
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Michael Armbrust
>  Labels: SPIP
>
> Given the current Source API, the minimum possible latency for any record is 
> bounded by the amount of time that it takes to launch a task.  This 
> limitation is a result of the fact that {{getBatch}} requires us to know both 
> the starting and the ending offset, before any tasks are launched.  In the 
> worst case, the end-to-end latency is actually closer to the average batch 
> time + task launching time.
> For applications where latency is more important than exactly-once output 
> however, it would be useful if processing could happen continuously.  This 
> would allow us to achieve fully pipelined reading and writing from sources 
> such as Kafka.  This kind of architecture would make it possible to process 
> records with end-to-end latencies on the order of 1 ms, rather than the 
> 10-100ms that is possible today.
> One possible architecture here would be to change the Source API to look like 
> the following rough sketch:
> {code}
>   trait Epoch {
> def data: DataFrame
> /** The exclusive starting position for `data`. */
> def startOffset: Offset
> /** The inclusive ending position for `data`.  Incrementally updated 
> during processing, but not complete until execution of the query plan in 
> `data` is finished. */
> def endOffset: Offset
>   }
>   def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], 
> limits: Limits): Epoch
> {code}
> The above would allow us to build an alternative implementation of 
> {{StreamExecution}} that processes continuously with much lower latency and 
> only stops processing when needing to reconfigure the stream (either due to a 
> failure or a user requested change in parallelism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20928) Continuous Processing Mode for Structured Streaming

2017-10-12 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-20928:

Labels: SPIP  (was: )

> Continuous Processing Mode for Structured Streaming
> ---
>
> Key: SPARK-20928
> URL: https://issues.apache.org/jira/browse/SPARK-20928
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Michael Armbrust
>  Labels: SPIP
>
> Given the current Source API, the minimum possible latency for any record is 
> bounded by the amount of time that it takes to launch a task.  This 
> limitation is a result of the fact that {{getBatch}} requires us to know both 
> the starting and the ending offset, before any tasks are launched.  In the 
> worst case, the end-to-end latency is actually closer to the average batch 
> time + task launching time.
> For applications where latency is more important than exactly-once output 
> however, it would be useful if processing could happen continuously.  This 
> would allow us to achieve fully pipelined reading and writing from sources 
> such as Kafka.  This kind of architecture would make it possible to process 
> records with end-to-end latencies on the order of 1 ms, rather than the 
> 10-100ms that is possible today.
> One possible architecture here would be to change the Source API to look like 
> the following rough sketch:
> {code}
>   trait Epoch {
> def data: DataFrame
> /** The exclusive starting position for `data`. */
> def startOffset: Offset
> /** The inclusive ending position for `data`.  Incrementally updated 
> during processing, but not complete until execution of the query plan in 
> `data` is finished. */
> def endOffset: Offset
>   }
>   def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], 
> limits: Limits): Epoch
> {code}
> The above would allow us to build an alternative implementation of 
> {{StreamExecution}} that processes continuously with much lower latency and 
> only stops processing when needing to reconfigure the stream (either due to a 
> failure or a user requested change in parallelism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21907) NullPointerException in UnsafeExternalSorter.spill()

2017-10-12 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202547#comment-16202547
 ] 

Dongjoon Hyun commented on SPARK-21907:
---

Hi, [~hvanhovell] and [~eyalfa].
I added `2.2.1` in fixed versions since it's merged into `branch-2.2` today.

> NullPointerException in UnsafeExternalSorter.spill()
> 
>
> Key: SPARK-21907
> URL: https://issues.apache.org/jira/browse/SPARK-21907
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Juliusz Sompolski
>Assignee: Eyal Farago
> Fix For: 2.2.1, 2.3.0
>
>
> I see NPE during sorting with the following stacktrace:
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:63)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:43)
>   at 
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:345)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:206)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:173)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:221)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:349)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:400)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at 
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:778)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoinExec.scala:685)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1$$anon$2.advanceNext(SortMergeJoinExec.scala:259)
>   at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at 

[jira] [Updated] (SPARK-21907) NullPointerException in UnsafeExternalSorter.spill()

2017-10-12 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-21907:
--
Fix Version/s: 2.2.1

> NullPointerException in UnsafeExternalSorter.spill()
> 
>
> Key: SPARK-21907
> URL: https://issues.apache.org/jira/browse/SPARK-21907
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Juliusz Sompolski
>Assignee: Eyal Farago
> Fix For: 2.2.1, 2.3.0
>
>
> I see NPE during sorting with the following stacktrace:
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:63)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:43)
>   at 
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:345)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:206)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:173)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:221)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:349)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:400)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at 
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:778)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoinExec.scala:685)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1$$anon$2.advanceNext(SortMergeJoinExec.scala:259)
>   at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:346)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> 

[jira] [Created] (SPARK-22267) Spark SQL incorrectly reads ORC file when column order is different

2017-10-12 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-22267:
-

 Summary: Spark SQL incorrectly reads ORC file when column order is 
different
 Key: SPARK-22267
 URL: https://issues.apache.org/jira/browse/SPARK-22267
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0, 2.1.0, 2.0.2, 1.6.3
Reporter: Dongjoon Hyun


For a long time, Apache Spark SQL returns incorrect results when ORC file 
schema is different from metastore schema order.

{code}
scala> Seq(1 -> 2).toDF("c1", 
"c2").write.format("parquet").mode("overwrite").save("/tmp/p")
scala> Seq(1 -> 2).toDF("c1", 
"c2").write.format("orc").mode("overwrite").save("/tmp/o")
scala> sql("CREATE EXTERNAL TABLE p(c2 INT, c1 INT) STORED AS parquet LOCATION 
'/tmp/p'")
scala> sql("CREATE EXTERNAL TABLE o(c2 INT, c1 INT) STORED AS orc LOCATION 
'/tmp/o'")
scala> spark.table("p").show  // Parquet is good.
+---+---+
| c2| c1|
+---+---+
|  2|  1|
+---+---+
scala> spark.table("o").show// This is wrong.
+---+---+
| c2| c1|
+---+---+
|  1|  2|
+---+---+
scala> spark.read.orc("/tmp/o").show  // This is correct.
+---+---+
| c1| c2|
+---+---+
|  1|  2|
+---+---+
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2017-10-12 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202385#comment-16202385
 ] 

Hossein Falaki commented on SPARK-15799:


Congrats everyone. Thanks for the hard work on this.

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>Assignee: Shivaram Venkataraman
> Fix For: 2.1.2
>
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22229) SPIP: RDMA Accelerated Shuffle Engine

2017-10-12 Thread Yuval Degani (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202368#comment-16202368
 ] 

Yuval Degani commented on SPARK-9:
--

Good point [~jerryshao].
Regarding testing on a machines without RDMA support:
For this exact reason, and also for cases where RDMA is used on a mixed 
cluster, where you may have both RDMA capable and non-RDMA capable machines, 
there is a software solution that is already part of the Linux kernel (version 
4.8+): "Soft-RoCE" aka "rxe".
Here are some links with more information:
https://elixir.free-electrons.com/linux/v4.8/source/drivers/infiniband/sw/rxe
https://community.mellanox.com/docs/DOC-2184
https://github.com/SoftRoCE/rxe-dev

Regarding your concern about maintaining the code:
I don't think that limited familiarity with a new promising feature is a good 
enough reason to avoid it. If every new feature will be treated this way, then 
new technologies will never get introduced to Spark.
For what it's worth, this is a project we take very seriously, and will gladly 
commit to maintaining and supporting it.

> SPIP: RDMA Accelerated Shuffle Engine
> -
>
> Key: SPARK-9
> URL: https://issues.apache.org/jira/browse/SPARK-9
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Yuval Degani
> Attachments: 
> SPARK-9_SPIP_RDMA_Accelerated_Shuffle_Engine_Rev_1.0.pdf
>
>
> An RDMA-accelerated shuffle engine can provide enormous performance benefits 
> to shuffle-intensive Spark jobs, as demonstrated in the “SparkRDMA” plugin 
> open-source project ([https://github.com/Mellanox/SparkRDMA]).
> Using RDMA for shuffle improves CPU utilization significantly and reduces I/O 
> processing overhead by bypassing the kernel and networking stack as well as 
> avoiding memory copies entirely. Those valuable CPU cycles are then consumed 
> directly by the actual Spark workloads, and help reducing the job runtime 
> significantly. 
> This performance gain is demonstrated with both industry standard HiBench 
> TeraSort (shows 1.5x speedup in sorting) as well as shuffle intensive 
> customer applications. 
> SparkRDMA will be presented at Spark Summit 2017 in Dublin 
> ([https://spark-summit.org/eu-2017/events/accelerating-shuffle-a-tailor-made-rdma-solution-for-apache-spark/]).
> Please see attached proposal document for more information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2017-10-12 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202360#comment-16202360
 ] 

Hyukjin Kwon commented on SPARK-15799:
--

Yay!

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>Assignee: Shivaram Venkataraman
> Fix For: 2.1.2
>
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2017-10-12 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202344#comment-16202344
 ] 

Dongjoon Hyun commented on SPARK-15799:
---

Great news, [~shivaram]!

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>Assignee: Shivaram Venkataraman
> Fix For: 2.1.2
>
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22266) The same aggregate function was evaluated multiple times

2017-10-12 Thread Maryann Xue (JIRA)
Maryann Xue created SPARK-22266:
---

 Summary: The same aggregate function was evaluated multiple times
 Key: SPARK-22266
 URL: https://issues.apache.org/jira/browse/SPARK-22266
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Maryann Xue
Priority: Minor


We should avoid the same aggregate function being evaluated more than once, and 
this is what has been stated in the code comment below (patterns.scala:206). 
However things didn't work as expected.
{code}
  // A single aggregate expression might appear multiple times in 
resultExpressions.
  // In order to avoid evaluating an individual aggregate function multiple 
times, we'll
  // build a set of the distinct aggregate expressions and build a function 
which can
  // be used to re-write expressions so that they reference the single copy 
of the
  // aggregate function which actually gets computed.
{code}
For example, the physical plan of
{code}
SELECT a, max(b+1), max(b+1) + 1 FROM testData2 GROUP BY a
{code}
was
{code}
HashAggregate(keys=[a#23], functions=[max((b#24 + 1)), max((b#24 + 1))], 
output=[a#23, max((b + 1))#223, (max((b + 1)) + 1)#224])
+- HashAggregate(keys=[a#23], functions=[partial_max((b#24 + 1)), 
partial_max((b#24 + 1))], output=[a#23, max#231, max#232])
   +- SerializeFromObject [assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData2, true]).a AS a#23, 
assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, 
true]).b AS b#24]
  +- Scan ExternalRDDScan[obj#22]
{code}
, where in each HashAggregate there were two identical aggregate functions 
"max(b#24 + 1)".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2017-10-12 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202330#comment-16202330
 ] 

Kazuaki Ishizaki commented on SPARK-16845:
--

[This PR|https://github.com/apache/spark/pull/18972] allows us to execute the 
above program {{testExcept.scala}} without throwing an exception.

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: hejie
>Assignee: Liwei Lin
> Fix For: 1.6.4, 2.0.3, 2.1.1, 2.2.0
>
> Attachments: error.txt.zip
>
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18350) Support session local timezone

2017-10-12 Thread Alexandre Dupriez (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202322#comment-16202322
 ] 

Alexandre Dupriez edited comment on SPARK-18350 at 10/12/17 5:30 PM:
-

Hello all,

I have a use case where a {{Dataset}} contains a column of type 
{{java.sql.Timestamp}} (let's call it {{_time}}) which I am using to derive new 
columns with the year, month, day and hour specified by the {{_time}} column, 
with something like:
{code:java}
session.read.schema(mySchema)
   .json(path)
   .withColumn("year", year($"_time"))
   .withColumn("month", month($"_time"))
   .withColumn("day", dayofmonth($"_time"))
   .withColumn("hour", hour($"_time"))
{code}
using the standard {{year}}, {{month}}, {{dayofmonth}} and {{hour}} functions 
defined in {{org.apache.spark.sql.functions}}.

Now let's assume the timezone is row dependent - and let's call {{_tz}} the 
column which contains it.The timezone is at the row level which is why I cannot 
configure the {{DataFrameWriter}} with a {{timeZone}} option.
I wondered if something like this would be advisable:
{code:java}
session.read.schema(mySchema)
  .json(path)
  .withColumn("year", year($"_time"))
  .withColumn("month", month($"_time"))
  .withColumn("day", dayofmonth($"_time"))
  .withColumn("hour", hour($"_time", $"_tz"))
{code}
Having a look at the definition of the {{hour}} function, it uses an {{Hour}} 
expression which can be constructed with an optional {{timeZoneId}}.
I have been trying to create an {{Hour}} expression but this is Spark-internal 
construct - and the API forbids to use it directly.
I guess providing a function {{hour(t: Column, tz: Column)}} along with the 
existing {{hour(t: Column)}} would not be a satisfying design.

Do you think a somehow elegant solution exists for this use case? Or is the 
methodology I use flawed - i.e. I should not derive the hour from a timestamp 
column if it happens to rely on a not predefined, row-dependent time zone like 
this?


was (Author: hangleton):
Hello all,

I have a use case where a {{Dataset}} contains a column of type 
{{java.sql.Timestamp}} (let's call it {{_time}}) which I am using to derive new 
columns with the year, month, day and hour specified by the {{_time}} column, 
with something like:
{code:java}
session.read.schema(mySchema)
 .json(path)
  .withColumn("year", year($"_time"))
  .withColumn("month", month($"_time"))
  .withColumn("day", dayofmonth($"_time"))
  .withColumn("hour", hour($"_time"))
{code}
using the standard {{year}}, {{month}}, {{dayofmonth}} and {{hour}} functions 
defined in {{org.apache.spark.sql.functions}}.

Now let's assume the timezone is row dependent - and let's call {{_tz}} the 
column which contains it.The timezone is at the row level which is why I cannot 
configure the {{DataFrameWriter}} with a {{timeZone}} option.
I wondered if something like this would be advisable:
{code:java}
session.read.schema(mySchema)
  .json(path)
  .withColumn("year", year($"_time"))
  .withColumn("month", month($"_time"))
  .withColumn("day", dayofmonth($"_time"))
  .withColumn("hour", hour($"_time", $"_tz"))
{code}
Having a look at the definition of the {{hour}} function, it uses an {{Hour}} 
expression which can be constructed with an optional {{timeZoneId}}.
I have been trying to create an {{Hour}} expression but this is Spark-internal 
construct - and the API forbids to use it directly.
I guess providing a function {{hour(t: Column, tz: Column)}} along with the 
existing {{hour(t: Column)}} would not be a satisfying design.

Do you think a somehow elegant solution exists for this use case? Or is the 
methodology I use flawed - i.e. I should not derive the hour from a timestamp 
column if it happens to rely on a not predefined, row-dependent time zone like 
this?

> Support session local timezone
> --
>
> Key: SPARK-18350
> URL: https://issues.apache.org/jira/browse/SPARK-18350
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Takuya Ueshin
>  Labels: releasenotes
> Fix For: 2.2.0
>
> Attachments: sample.csv
>
>
> As of Spark 2.1, Spark SQL assumes the machine timezone for datetime 
> manipulation, which is bad if users are not in the same timezones as the 
> machines, or if different users have different timezones.
> We should introduce a session local timezone setting that is used for 
> execution.
> An explicit non-goal is locale handling.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (SPARK-18350) Support session local timezone

2017-10-12 Thread Alexandre Dupriez (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202322#comment-16202322
 ] 

Alexandre Dupriez edited comment on SPARK-18350 at 10/12/17 5:30 PM:
-

Hello all,

I have a use case where a {{Dataset}} contains a column of type 
{{java.sql.Timestamp}} (let's call it {{_time}}) which I am using to derive new 
columns with the year, month, day and hour specified by the {{_time}} column, 
with something like:
{code:java}
session.read.schema(mySchema)
.json(path)
.withColumn("year", year($"_time"))
.withColumn("month", month($"_time"))
.withColumn("day", dayofmonth($"_time"))
.withColumn("hour", hour($"_time"))
{code}
using the standard {{year}}, {{month}}, {{dayofmonth}} and {{hour}} functions 
defined in {{org.apache.spark.sql.functions}}.

Now let's assume the timezone is row dependent - and let's call {{_tz}} the 
column which contains it.The timezone is at the row level which is why I cannot 
configure the {{DataFrameWriter}} with a {{timeZone}} option.
I wondered if something like this would be advisable:
{code:java}
session.read.schema(mySchema)
.json(path)
.withColumn("year", year($"_time"))
.withColumn("month", month($"_time"))
.withColumn("day", dayofmonth($"_time"))
.withColumn("hour", hour($"_time", $"_tz"))
{code}
Having a look at the definition of the {{hour}} function, it uses an {{Hour}} 
expression which can be constructed with an optional {{timeZoneId}}.
I have been trying to create an {{Hour}} expression but this is Spark-internal 
construct - and the API forbids to use it directly.
I guess providing a function {{hour(t: Column, tz: Column)}} along with the 
existing {{hour(t: Column)}} would not be a satisfying design.

Do you think a somehow elegant solution exists for this use case? Or is the 
methodology I use flawed - i.e. I should not derive the hour from a timestamp 
column if it happens to rely on a not predefined, row-dependent time zone like 
this?


was (Author: hangleton):
Hello all,

I have a use case where a {{Dataset}} contains a column of type 
{{java.sql.Timestamp}} (let's call it {{_time}}) which I am using to derive new 
columns with the year, month, day and hour specified by the {{_time}} column, 
with something like:
{code:java}
session.read.schema(mySchema)
   .json(path)
   .withColumn("year", year($"_time"))
   .withColumn("month", month($"_time"))
   .withColumn("day", dayofmonth($"_time"))
   .withColumn("hour", hour($"_time"))
{code}
using the standard {{year}}, {{month}}, {{dayofmonth}} and {{hour}} functions 
defined in {{org.apache.spark.sql.functions}}.

Now let's assume the timezone is row dependent - and let's call {{_tz}} the 
column which contains it.The timezone is at the row level which is why I cannot 
configure the {{DataFrameWriter}} with a {{timeZone}} option.
I wondered if something like this would be advisable:
{code:java}
session.read.schema(mySchema)
  .json(path)
  .withColumn("year", year($"_time"))
  .withColumn("month", month($"_time"))
  .withColumn("day", dayofmonth($"_time"))
  .withColumn("hour", hour($"_time", $"_tz"))
{code}
Having a look at the definition of the {{hour}} function, it uses an {{Hour}} 
expression which can be constructed with an optional {{timeZoneId}}.
I have been trying to create an {{Hour}} expression but this is Spark-internal 
construct - and the API forbids to use it directly.
I guess providing a function {{hour(t: Column, tz: Column)}} along with the 
existing {{hour(t: Column)}} would not be a satisfying design.

Do you think a somehow elegant solution exists for this use case? Or is the 
methodology I use flawed - i.e. I should not derive the hour from a timestamp 
column if it happens to rely on a not predefined, row-dependent time zone like 
this?

> Support session local timezone
> --
>
> Key: SPARK-18350
> URL: https://issues.apache.org/jira/browse/SPARK-18350
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Takuya Ueshin
>  Labels: releasenotes
> Fix For: 2.2.0
>
> Attachments: sample.csv
>
>
> As of Spark 2.1, Spark SQL assumes the machine timezone for datetime 
> manipulation, which is bad if users are not in the same timezones as the 
> machines, or if different users have different timezones.
> We should introduce a session local timezone setting that is used for 
> execution.
> An explicit non-goal is locale handling.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: 

[jira] [Comment Edited] (SPARK-18350) Support session local timezone

2017-10-12 Thread Alexandre Dupriez (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202322#comment-16202322
 ] 

Alexandre Dupriez edited comment on SPARK-18350 at 10/12/17 5:29 PM:
-

Hello all,

I have a use case where a {{Dataset}} contains a column of type 
{{java.sql.Timestamp}} (let's call it {{_time}}) which I am using to derive new 
columns with the year, month, day and hour specified by the {{_time}} column, 
with something like:
{code:java}
session.read.schema(mySchema)
 .json(path)
  .withColumn("year", year($"_time"))
  .withColumn("month", month($"_time"))
  .withColumn("day", dayofmonth($"_time"))
  .withColumn("hour", hour($"_time"))
{code}
using the standard {{year}}, {{month}}, {{dayofmonth}} and {{hour}} functions 
defined in {{org.apache.spark.sql.functions}}.

Now let's assume the timezone is row dependent - and let's call {{_tz}} the 
column which contains it.The timezone is at the row level which is why I cannot 
configure the {{DataFrameWriter}} with a {{timeZone}} option.
I wondered if something like this would be advisable:
{code:java}
session.read.schema(mySchema)
  .json(path)
  .withColumn("year", year($"_time"))
  .withColumn("month", month($"_time"))
  .withColumn("day", dayofmonth($"_time"))
  .withColumn("hour", hour($"_time", $"_tz"))
{code}
Having a look at the definition of the {{hour}} function, it uses an {{Hour}} 
expression which can be constructed with an optional {{timeZoneId}}.
I have been trying to create an {{Hour}} expression but this is Spark-internal 
construct - and the API forbids to use it directly.
I guess providing a function {{hour(t: Column, tz: Column)}} along with the 
existing {{hour(t: Column)}} would not be a satisfying design.

Do you think a somehow elegant solution exists for this use case? Or is the 
methodology I use flawed - i.e. I should not derive the hour from a timestamp 
column if it happens to rely on a not predefined, row-dependent time zone like 
this?


was (Author: hangleton):
Hello all,

I have a use case where a {{Dataset}} contains a column of type 
{{java.sql.Timestamp}} (let's call it {{_time}}) which I am using to derive new 
columns with the year, month, day and hour specified by the {{_time}} column, 
with something like:
{code:java}
session.read.schema(mySchema)
  .json(path)
  .withColumn("year", year($"_time"))
  .withColumn("month", month($"_time"))
  .withColumn("day", dayofmonth($"_time"))
  .withColumn("hour", hour($"_time"))
{code}
using the standard {{year}}, {{month}}, {{dayofmonth}} and {{hour}} functions 
defined in {{org.apache.spark.sql.functions}}.

Now let's assume the timezone is row dependent - and let's call {{_tz}} the 
column which contains it.The timezone is at the row level which is why I cannot 
configure the {{DataFrameWriter}} with a {{timeZone}} option.
I wondered if something like this would be advisable:
{code:java}
session.read.schema(mySchema)
  .json(path)
  .withColumn("year", year($"_time"))
  .withColumn("month", month($"_time"))
  .withColumn("day", dayofmonth($"_time"))
  .withColumn("hour", hour($"_time", $"_tz"))
{code}
Having a look at the definition of the {{hour}} function, it uses an {{Hour}} 
expression which can be constructed with an optional {{timeZoneId}}.
I have been trying to create an {{Hour}} expression but this is Spark-internal 
construct - and the API forbids to use it directly.
I guess providing a function {{hour(t: Column, tz: Column)}} along with the 
existing {{hour(t: Column)}} would not be a satisfying design.

Do you think a somehow elegant solution exists for this use case? Or is the 
methodology I use flawed - i.e. I should not derive the hour from a timestamp 
column if it happens to rely on a not predefined, row-dependent time zone like 
this?

> Support session local timezone
> --
>
> Key: SPARK-18350
> URL: https://issues.apache.org/jira/browse/SPARK-18350
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Takuya Ueshin
>  Labels: releasenotes
> Fix For: 2.2.0
>
> Attachments: sample.csv
>
>
> As of Spark 2.1, Spark SQL assumes the machine timezone for datetime 
> manipulation, which is bad if users are not in the same timezones as the 
> machines, or if different users have different timezones.
> We should introduce a session local timezone setting that is used for 
> 

[jira] [Commented] (SPARK-18350) Support session local timezone

2017-10-12 Thread Alexandre Dupriez (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202322#comment-16202322
 ] 

Alexandre Dupriez commented on SPARK-18350:
---

Hello all,

I have a use case where a {{Dataset}} contains a column of type 
{{java.sql.Timestamp}} (let's call it {{_time}}) which I am using to derive new 
columns with the year, month, day and hour specified by the {{_time}} column, 
with something like:
{code:java}
session.read.schema(mySchema)
  .json(path)
  .withColumn("year", year($"_time"))
  .withColumn("month", month($"_time"))
  .withColumn("day", dayofmonth($"_time"))
  .withColumn("hour", hour($"_time"))
{code}
using the standard {{year}}, {{month}}, {{dayofmonth}} and {{hour}} functions 
defined in {{org.apache.spark.sql.functions}}.

Now let's assume the timezone is row dependent - and let's call {{_tz}} the 
column which contains it.The timezone is at the row level which is why I cannot 
configure the {{DataFrameWriter}} with a {{timeZone}} option.
I wondered if something like this would be advisable:
{code:java}
session.read.schema(mySchema)
  .json(path)
  .withColumn("year", year($"_time"))
  .withColumn("month", month($"_time"))
  .withColumn("day", dayofmonth($"_time"))
  .withColumn("hour", hour($"_time", $"_tz"))
{code}
Having a look at the definition of the {{hour}} function, it uses an {{Hour}} 
expression which can be constructed with an optional {{timeZoneId}}.
I have been trying to create an {{Hour}} expression but this is Spark-internal 
construct - and the API forbids to use it directly.
I guess providing a function {{hour(t: Column, tz: Column)}} along with the 
existing {{hour(t: Column)}} would not be a satisfying design.

Do you think a somehow elegant solution exists for this use case? Or is the 
methodology I use flawed - i.e. I should not derive the hour from a timestamp 
column if it happens to rely on a not predefined, row-dependent time zone like 
this?

> Support session local timezone
> --
>
> Key: SPARK-18350
> URL: https://issues.apache.org/jira/browse/SPARK-18350
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Takuya Ueshin
>  Labels: releasenotes
> Fix For: 2.2.0
>
> Attachments: sample.csv
>
>
> As of Spark 2.1, Spark SQL assumes the machine timezone for datetime 
> manipulation, which is bad if users are not in the same timezones as the 
> machines, or if different users have different timezones.
> We should introduce a session local timezone setting that is used for 
> execution.
> An explicit non-goal is locale handling.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results

2017-10-12 Thread Benyi Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202278#comment-16202278
 ] 

Benyi Wang commented on SPARK-22211:


I think my suggestion solution is correct.

|| Case || Left join key || Right join key || Full outer join ||
| 1 | Y | N | {{(left(\*), null)}} |
| 2 | Y | Y | {{(left(\*), right(\*))}} |
| 3 | N | Y | {{(null, right(\*))}} |
| 4 | N | N | Not applied |

If LimitPushDown pushes limit to the left side, whatever a limit value is and 
how big of left side table, you will always select some rows, in other words, 
the join keys are always exists, and only case 1 and 2 will happen, so it is 
actually a Left-join instead. It is equivalent to right-join when pushing down 
to  the right side.

The only problem of this method is: case 3 has no chance to be shown while 
pushing down to the left side, and case 1 for the right side. I would say this 
is not a big issue because we just want to see some samples of the join result, 
but the benefit is huge. If we want to see left-only or right-only, we might 
add where clause.  

> LimitPushDown optimization for FullOuterJoin generates wrong results
> 
>
> Key: SPARK-22211
> URL: https://issues.apache.org/jira/browse/SPARK-22211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: on community.cloude.databrick.com 
> Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11)
>Reporter: Benyi Wang
>
> LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may 
> generate a wrong result:
> Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 
> is selected, but at right side we have 100K rows including 999, the result 
> will be
> - one row is (999, 999)
> - the rest rows are (null, xxx)
> Once you call show(), the row (999,999) has only 1/10th chance to be 
> selected by CollectLimit.
> The actual optimization might be, 
> - push down limit
> - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin.
> Here is my notebook:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html
> {code:java}
> import scala.util.Random._
> val dl = shuffle(1 to 10).toDF("id")
> val dr = shuffle(1 to 10).toDF("id")
> println("data frame dl:")
> dl.explain
> println("data frame dr:")
> dr.explain
> val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1)
> j.explain
> j.show(false)
> {code}
> {code}
> data frame dl:
> == Physical Plan ==
> LocalTableScan [id#10]
> data frame dr:
> == Physical Plan ==
> LocalTableScan [id#16]
> == Physical Plan ==
> CollectLimit 1
> +- SortMergeJoin [id#10], [id#16], FullOuter
>:- *Sort [id#10 ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#10, 200)
>: +- *LocalLimit 1
>:+- LocalTableScan [id#10]
>+- *Sort [id#16 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(id#16, 200)
>  +- LocalTableScan [id#16]
> import scala.util.Random._
> dl: org.apache.spark.sql.DataFrame = [id: int]
> dr: org.apache.spark.sql.DataFrame = [id: int]
> j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int]
> ++---+
> |id  |id |
> ++---+
> |null|148|
> ++---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2017-10-12 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202260#comment-16202260
 ] 

Shivaram Venkataraman commented on SPARK-15799:
---

This is now live ! https://cran.r-project.org/web/packages/SparkR/

Marking this issue as resolved.

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
> Fix For: 2.1.2
>
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15799) Release SparkR on CRAN

2017-10-12 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-15799.
---
   Resolution: Fixed
 Assignee: Shivaram Venkataraman
Fix Version/s: 2.1.2

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>Assignee: Shivaram Venkataraman
> Fix For: 2.1.2
>
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20055) Documentation for CSV datasets in SQL programming guide

2017-10-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202174#comment-16202174
 ] 

Apache Spark commented on SPARK-20055:
--

User 'jomach' has created a pull request for this issue:
https://github.com/apache/spark/pull/19485

> Documentation for CSV datasets in SQL programming guide
> ---
>
> Key: SPARK-20055
> URL: https://issues.apache.org/jira/browse/SPARK-20055
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>
> I guess things commonly used and important are documented there rather than 
> documenting everything and every option in the programming guide - 
> http://spark.apache.org/docs/latest/sql-programming-guide.html.
> It seems JSON datasets 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets 
> are documented whereas CSV datasets are not. 
> Nowadays, they are pretty similar in APIs and options. Some options are 
> notable for both, In particular, ones such as {{wholeFile}}. Moreover, 
> several options such as {{inferSchema}} and {{header}} are important in CSV 
> that affect the type/column name of data.
> In that sense, I think we might better document CSV datasets with some 
> examples too because I believe reading CSV is pretty much common use cases.
> Also, I think we could also leave some pointers for options of API 
> documentations for both (rather than duplicating the documentation).
> So, my suggestion is,
> - Add CSV Datasets section.
> - Add links for options for both JSON and CSV that point each API 
> documentation
> - Fix trivial minor fixes together in both sections.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22265) pyspark can't erialization object

2017-10-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22265.
---
Resolution: Invalid

Please ask questions on the mailing list or StackOverflow, though you'll need 
to provide a lot more info and background research if you want help.

> pyspark can't erialization object 
> --
>
> Key: SPARK-22265
> URL: https://issues.apache.org/jira/browse/SPARK-22265
> Project: Spark
>  Issue Type: IT Help
>  Components: DStreams, PySpark
>Affects Versions: 2.2.0
>Reporter: bianxiaokun
>
> When creating kafka producer, it can not be serialized
> How to handle it? please help me



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22265) pyspark can't erialization object

2017-10-12 Thread bianxiaokun (JIRA)
bianxiaokun created SPARK-22265:
---

 Summary: pyspark can't erialization object 
 Key: SPARK-22265
 URL: https://issues.apache.org/jira/browse/SPARK-22265
 Project: Spark
  Issue Type: IT Help
  Components: DStreams, PySpark
Affects Versions: 2.2.0
Reporter: bianxiaokun


When creating kafka producer, it can not be serialized
How to handle it? please help me



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22248) spark marks all columns as null when its unable to parse single column

2017-10-12 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202116#comment-16202116
 ] 

Hyukjin Kwon commented on SPARK-22248:
--

I think either way breaks existing support. JSON was started from Spark 1.4.0 
and CSV was started from the third party. We should match them but not sure 
which one should be treated as a bug to fix. And, yea, I guess it'd need a 
careful look. I'd be more complex if we support SPARK-20990 BTW.

> spark marks all columns as null when its unable to parse single column
> --
>
> Key: SPARK-22248
> URL: https://issues.apache.org/jira/browse/SPARK-22248
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.0
>Reporter: Vignesh Mohan
>
> when parsing JSON data in `PERMISSIVE` mode if one column mismatches the 
> schema it attributes all column values as null.
> {code}
> val conf = new SparkConf().setMaster("local").setAppName("app")
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> val sparkschema : StructType = {
>   StructType(StructField("name", StringType) :: StructField("count", 
> LongType) :: Nil)
> }
>   val rdd = sc.parallelize(List(
> """
>   |{"name": "foo", "count": 24.0}}
>   |""".stripMargin,
> """
>   |{"name": "bar", "count": 24}}
>   |""".stripMargin
>   ))
>   
> sqlContext.read.schema(sparkschema).json(rdd).createOrReplaceTempView("events")
>   sqlContext.sql(
> """
>   | select
>   | name,count
>   | from
>   | events
> """.stripMargin).collect.foreach(println)
> {code}
> Output:
> {code}
> 17/10/11 03:12:04 WARN JacksonParser: Found at least one malformed records 
> (sample: 
> {"name": "foo", "count": 24.0}}
> ). The JSON reader will replace
> all malformed records with placeholder null in current PERMISSIVE parser mode.
> To find out which corrupted records have been replaced with null, please use 
> the
> default inferred schema instead of providing a custom schema.
> Code example to print all malformed records (scala):
> ===
> // The corrupted record exists in column _corrupt_record.
> val parsedJson = spark.read.json("/path/to/json/file/test.json")
>
> [null,null]
> [bar,24]
> {code}
> Expected output:
> {code}
> [foo,null]
> [bar,24]
> {code}
> The problem comes from 
> `spark-catalyst_2.11-2.1.0-sources.jar!/org/apache/spark/sql/catalyst/json/JacksonParser.scala`
> {code}
> private def failedConversion(
>   parser: JsonParser,
>   dataType: DataType): PartialFunction[JsonToken, Any] = {
> case VALUE_STRING if parser.getTextLength < 1 =>
>   // If conversion is failed, this produces `null` rather than throwing 
> exception.
>   // This will protect the mismatch of types.
>   null
> case token =>
>   // We cannot parse this token based on the given data type. So, we 
> throw a
>   // SparkSQLJsonProcessingException and this exception will be caught by
>   // `parse` method.
>   throw new SparkSQLJsonProcessingException(
> s"Failed to parse a value for data type $dataType (current token: 
> $token).")
>   }
> {code}
> this raises an exception when parsing the column and 
> {code}
> def parse(input: String): Seq[InternalRow] = {
> if (input.trim.isEmpty) {
>   Nil
> } else {
>   try {
> Utils.tryWithResource(factory.createParser(input)) { parser =>
>   parser.nextToken()
>   rootConverter.apply(parser) match {
> case null => failedRecord(input)
> case row: InternalRow => row :: Nil
> case array: ArrayData =>
>   // Here, as we support reading top level JSON arrays and take 
> every element
>   // in such an array as a row, this case is possible.
>   if (array.numElements() == 0) {
> Nil
>   } else {
> array.toArray[InternalRow](schema)
>   }
> case _ =>
>   failedRecord(input)
>   }
> }
>   } catch {
> case _: JsonProcessingException =>
>   failedRecord(input)
> case _: SparkSQLJsonProcessingException =>
>   failedRecord(input)
>   }
> }
>   }
> {code}
> marks the whole record as failedRecord. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22164) support histogram in estimating the cardinality of aggregate (or group-by) operator

2017-10-12 Thread Zhenhua Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202081#comment-16202081
 ] 

Zhenhua Wang edited comment on SPARK-22164 at 10/12/17 3:21 PM:


[~ron8hu] I don't think we will use histogram in group-by estimation, ndv can 
be updated through propagation based on histogram (if any), we only need ndv 
here. Let's close this?


was (Author: zenwzh):
[~ron8hu] I don't think histogram can help with group-by, let's close this?

> support histogram in estimating the cardinality of aggregate (or group-by) 
> operator
> ---
>
> Key: SPARK-22164
> URL: https://issues.apache.org/jira/browse/SPARK-22164
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ron Hu
>
> Histogram is effective in dealing with skewed distribution. After we generate 
> histogram information for column statistics, we need to adjust aggregate (or 
> group-by) cardinality estimation based on equi-height histogram information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22164) support histogram in estimating the cardinality of aggregate (or group-by) operator

2017-10-12 Thread Zhenhua Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang resolved SPARK-22164.
--
  Resolution: Won't Fix
Target Version/s:   (was: 2.3.0)

> support histogram in estimating the cardinality of aggregate (or group-by) 
> operator
> ---
>
> Key: SPARK-22164
> URL: https://issues.apache.org/jira/browse/SPARK-22164
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ron Hu
>
> Histogram is effective in dealing with skewed distribution. After we generate 
> histogram information for column statistics, we need to adjust aggregate (or 
> group-by) cardinality estimation based on equi-height histogram information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-22164) support histogram in estimating the cardinality of aggregate (or group-by) operator

2017-10-12 Thread Zhenhua Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang closed SPARK-22164.


> support histogram in estimating the cardinality of aggregate (or group-by) 
> operator
> ---
>
> Key: SPARK-22164
> URL: https://issues.apache.org/jira/browse/SPARK-22164
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ron Hu
>
> Histogram is effective in dealing with skewed distribution. After we generate 
> histogram information for column statistics, we need to adjust aggregate (or 
> group-by) cardinality estimation based on equi-height histogram information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22164) support histogram in estimating the cardinality of aggregate (or group-by) operator

2017-10-12 Thread Zhenhua Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202081#comment-16202081
 ] 

Zhenhua Wang commented on SPARK-22164:
--

[~ron8hu] I don't think histogram can help with group-by, let's close this?

> support histogram in estimating the cardinality of aggregate (or group-by) 
> operator
> ---
>
> Key: SPARK-22164
> URL: https://issues.apache.org/jira/browse/SPARK-22164
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ron Hu
>
> Histogram is effective in dealing with skewed distribution. After we generate 
> histogram information for column statistics, we need to adjust aggregate (or 
> group-by) cardinality estimation based on equi-height histogram information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22251) Metric "aggregate time" is incorrect when codegen is off

2017-10-12 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-22251.
---
   Resolution: Fixed
 Assignee: Ala Luszczak
Fix Version/s: 2.3.0

> Metric "aggregate time" is incorrect when codegen is off
> 
>
> Key: SPARK-22251
> URL: https://issues.apache.org/jira/browse/SPARK-22251
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ala Luszczak
>Assignee: Ala Luszczak
>Priority: Minor
> Fix For: 2.3.0
>
>
> When whole-stage codegen is off, metric "aggregate time" is not set correctly.
> Repro:
> # spark.conf.set("spark.sql.codegen.wholeStage", false)
> # 
> spark.range(5).crossJoin(spark.range(5)).toDF("a","b").groupBy("a").agg(sum("b")).show
> # In Spark UI > SQL you can see that "aggregate time total" is 0 in 
> "HashAggregate" boxes



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22259) hdfs://HdfsHA/logrep/1/sspstatistic/_metadata is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [5, 28, 21, 12]

2017-10-12 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202071#comment-16202071
 ] 

Hyukjin Kwon commented on SPARK-22259:
--

Looks a dupe of SPARK-22260 BTW.

>  hdfs://HdfsHA/logrep/1/sspstatistic/_metadata is not a Parquet file. 
> expected magic number at tail [80, 65, 82, 49] but found [5, 28, 21, 12]
> --
>
> Key: SPARK-22259
> URL: https://issues.apache.org/jira/browse/SPARK-22259
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
> Environment: spark1.6.1,YARN2.6.0
>Reporter: Liu Dinghua
>
> my codes with errors  are as follow:
> sqlContext.read.parquet("hdfs://HdfsHA/logrep/1/sspstatistic/")
> java.io.IOException: Could not read footer: java.lang.RuntimeException: 
> hdfs://HdfsHA/logrep/1/sspstatistic/_metadata is not a Parquet file (too 
> small)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:247)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$27.apply(ParquetRelation.scala:786)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$27.apply(ParquetRelation.scala:775)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: 
> hdfs://HdfsHA/logrep/1/sspstatistic/_metadata is not a Parquet file (too 
> small)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:412)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:237)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:233)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   ... 3 more
> 17/10/12 10:41:04 INFO scheduler.TaskSetManager: Starting task 1.1 in stage 
> 1.0 (TID 4, slave05, partition 1,PROCESS_LOCAL, 968186 bytes)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22259) hdfs://HdfsHA/logrep/1/sspstatistic/_metadata is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [5, 28, 21, 12]

2017-10-12 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22259.
--
Resolution: Duplicate

>  hdfs://HdfsHA/logrep/1/sspstatistic/_metadata is not a Parquet file. 
> expected magic number at tail [80, 65, 82, 49] but found [5, 28, 21, 12]
> --
>
> Key: SPARK-22259
> URL: https://issues.apache.org/jira/browse/SPARK-22259
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
> Environment: spark1.6.1,YARN2.6.0
>Reporter: Liu Dinghua
>
> my codes with errors  are as follow:
> sqlContext.read.parquet("hdfs://HdfsHA/logrep/1/sspstatistic/")
> java.io.IOException: Could not read footer: java.lang.RuntimeException: 
> hdfs://HdfsHA/logrep/1/sspstatistic/_metadata is not a Parquet file (too 
> small)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:247)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$27.apply(ParquetRelation.scala:786)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$27.apply(ParquetRelation.scala:775)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: 
> hdfs://HdfsHA/logrep/1/sspstatistic/_metadata is not a Parquet file (too 
> small)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:412)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:237)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:233)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   ... 3 more
> 17/10/12 10:41:04 INFO scheduler.TaskSetManager: Starting task 1.1 in stage 
> 1.0 (TID 4, slave05, partition 1,PROCESS_LOCAL, 968186 bytes)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22260) java.lang.RuntimeException: hdfs://HdfsHA/logrep/1/sspstatistic/_metadata is not a Parquet file (too small)

2017-10-12 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202070#comment-16202070
 ] 

Hyukjin Kwon commented on SPARK-22260:
--

Does this happen randomly? Can you provide a reproducer and/or try this out in 
a higher version?

>  java.lang.RuntimeException: hdfs://HdfsHA/logrep/1/sspstatistic/_metadata is 
> not a Parquet file (too small)
> 
>
> Key: SPARK-22260
> URL: https://issues.apache.org/jira/browse/SPARK-22260
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: spark1.6.1,YARN2.6.0
>Reporter: Liu Dinghua
>
> the codes which encountered errors are as follow:
> *   sqlContext.read.parquet("/logrep/1/sspstatistic")*
> the detail errors are as follow:
> 17/10/12 10:41:04 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.0 
> (TID 3, slave05): java.io.IOException: Could not read footer: 
> java.lang.RuntimeException: hdfs://HdfsHA/logrep/1/sspstatistic/_metadata is 
> not a Parquet file (too small)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:247)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$27.apply(ParquetRelation.scala:786)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$27.apply(ParquetRelation.scala:775)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: 
> hdfs://HdfsHA/logrep/1/sspstatistic/_metadata is not a Parquet file (too 
> small)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:412)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:237)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:233)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   ... 3 more
> 17/10/12 10:41:04 INFO scheduler.TaskSetManager: Starting task 1.1 in stage 
> 1.0 (TID 4, slave05, partition 1,PROCESS_LOCAL, 968186 bytes)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22259) hdfs://HdfsHA/logrep/1/sspstatistic/_metadata is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [5, 28, 21, 12]

2017-10-12 Thread Kaushal Prajapati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202052#comment-16202052
 ] 

Kaushal Prajapati commented on SPARK-22259:
---

Seem like file which you are using is not parquet.

>  hdfs://HdfsHA/logrep/1/sspstatistic/_metadata is not a Parquet file. 
> expected magic number at tail [80, 65, 82, 49] but found [5, 28, 21, 12]
> --
>
> Key: SPARK-22259
> URL: https://issues.apache.org/jira/browse/SPARK-22259
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
> Environment: spark1.6.1,YARN2.6.0
>Reporter: Liu Dinghua
>
> my codes with errors  are as follow:
> sqlContext.read.parquet("hdfs://HdfsHA/logrep/1/sspstatistic/")
> java.io.IOException: Could not read footer: java.lang.RuntimeException: 
> hdfs://HdfsHA/logrep/1/sspstatistic/_metadata is not a Parquet file (too 
> small)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:247)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$27.apply(ParquetRelation.scala:786)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$27.apply(ParquetRelation.scala:775)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: 
> hdfs://HdfsHA/logrep/1/sspstatistic/_metadata is not a Parquet file (too 
> small)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:412)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:237)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:233)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   ... 3 more
> 17/10/12 10:41:04 INFO scheduler.TaskSetManager: Starting task 1.1 in stage 
> 1.0 (TID 4, slave05, partition 1,PROCESS_LOCAL, 968186 bytes)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20712) [SPARK 2.1 REGRESSION][SQL] Spark can't read Hive table when column type has length greater than 4000 bytes

2017-10-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-20712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202016#comment-16202016
 ] 

Maciej Bryński commented on SPARK-20712:


After ALTER I needed to recreate all type strings in Hive.
(script in Python) 

> [SPARK 2.1 REGRESSION][SQL] Spark can't read Hive table when column type has 
> length greater than 4000 bytes
> ---
>
> Key: SPARK-20712
> URL: https://issues.apache.org/jira/browse/SPARK-20712
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.1.2, 2.3.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> Hi,
> I have following issue.
> I'm trying to read a table from hive when one of the column is nested so it's 
> schema has length longer than 4000 bytes.
> Everything worked on Spark 2.0.2. On 2.1.1 I'm getting Exception:
> {code}
> >> spark.read.table("SOME_TABLE")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/opt/spark-2.1.1/python/pyspark/sql/readwriter.py", line 259, in table
> return self._df(self._jreader.table(tableName))
>   File 
> "/opt/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
> 1133, in __call__
>   File "/opt/spark-2.1.1/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/opt/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", 
> line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o71.table.
> : org.apache.spark.SparkException: Cannot recognize hive type string: 
> SOME_VERY_LONG_FIELD_TYPE
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:361)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:359)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:279)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:226)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:225)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:268)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:359)
> at 
> org.apache.spark.sql.hive.client.HiveClient$class.getTable(HiveClient.scala:74)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:78)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable(HiveExternalCatalog.scala:117)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:628)
> at 
> 

[jira] [Updated] (SPARK-20712) [SPARK 2.1 REGRESSION][SQL] Spark can't read Hive table when column type has length greater than 4000 bytes

2017-10-12 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-20712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-20712:
---
Affects Version/s: 2.2.0

> [SPARK 2.1 REGRESSION][SQL] Spark can't read Hive table when column type has 
> length greater than 4000 bytes
> ---
>
> Key: SPARK-20712
> URL: https://issues.apache.org/jira/browse/SPARK-20712
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.1.2, 2.2.0, 2.3.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> Hi,
> I have following issue.
> I'm trying to read a table from hive when one of the column is nested so it's 
> schema has length longer than 4000 bytes.
> Everything worked on Spark 2.0.2. On 2.1.1 I'm getting Exception:
> {code}
> >> spark.read.table("SOME_TABLE")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/opt/spark-2.1.1/python/pyspark/sql/readwriter.py", line 259, in table
> return self._df(self._jreader.table(tableName))
>   File 
> "/opt/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
> 1133, in __call__
>   File "/opt/spark-2.1.1/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/opt/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", 
> line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o71.table.
> : org.apache.spark.SparkException: Cannot recognize hive type string: 
> SOME_VERY_LONG_FIELD_TYPE
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:361)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:359)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:279)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:226)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:225)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:268)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:359)
> at 
> org.apache.spark.sql.hive.client.HiveClient$class.getTable(HiveClient.scala:74)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:78)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable(HiveExternalCatalog.scala:117)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:628)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:628)
> at 
> 

[jira] [Commented] (SPARK-22252) FileFormatWriter should respect the input query schema

2017-10-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202004#comment-16202004
 ] 

Apache Spark commented on SPARK-22252:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/19484

> FileFormatWriter should respect the input query schema
> --
>
> Key: SPARK-22252
> URL: https://issues.apache.org/jira/browse/SPARK-22252
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21165) Fail to write into partitioned hive table due to attribute reference not working with cast on partition column

2017-10-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16201988#comment-16201988
 ] 

Apache Spark commented on SPARK-21165:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/19483

> Fail to write into partitioned hive table due to attribute reference not 
> working with cast on partition column
> --
>
> Key: SPARK-21165
> URL: https://issues.apache.org/jira/browse/SPARK-21165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Imran Rashid
>Assignee: Xiao Li
>Priority: Blocker
> Fix For: 2.2.0
>
>
> A simple "insert into ... select" involving partitioned hive tables fails.  
> Here's a simpler repro which doesn't involve hive at all -- this succeeds on 
> 2.1.1, but fails on 2.2.0-rc5:
> {noformat}
> spark.sql("""SET hive.exec.dynamic.partition.mode=nonstrict""")
> spark.sql("""DROP TABLE IF EXISTS src""")
> spark.sql("""DROP TABLE IF EXISTS dest""")
> spark.sql("""
> CREATE TABLE src (first string, word string)
>   PARTITIONED BY (length int)
> """)
> spark.sql("""
> INSERT INTO src PARTITION(length) VALUES
>   ('a', 'abc', 3),
>   ('b', 'bcde', 4),
>   ('c', 'cdefg', 5)
> """)
> spark.sql("""
>   CREATE TABLE dest (word string, length int)
> PARTITIONED BY (first string)
> """)
> spark.sql("""
>   INSERT INTO TABLE dest PARTITION(first) SELECT word, length, cast(first as 
> string) as first FROM src
> """)
> {noformat}
> The exception is
> {noformat}
> 17/06/21 14:25:53 WARN TaskSetManager: Lost task 1.0 in stage 4.0 (TID 10, 
> localhost, executor driver): 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> , tree: first#74
> at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$$anonfun$bind$1.apply(GenerateOrdering.scala:49)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$$anonfun$bind$1.apply(GenerateOrdering.scala:49)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.bind(GenerateOrdering.scala:49)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.bind(GenerateOrdering.scala:43)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:884)
> at 
> org.apache.spark.sql.execution.SparkPlan.newOrdering(SparkPlan.scala:363)
> at 
> org.apache.spark.sql.execution.SortExec.createSorter(SortExec.scala:63)
> at 
> org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:102)
>  

[jira] [Commented] (SPARK-22247) Hive partition filter very slow

2017-10-12 Thread Noam Asor (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16201971#comment-16201971
 ] 

Noam Asor commented on SPARK-22247:
---

Maybe this issue is related SPARK-17992

> Hive partition filter very slow
> ---
>
> Key: SPARK-22247
> URL: https://issues.apache.org/jira/browse/SPARK-22247
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.0.2, 2.1.1
>Reporter: Patrick Duin
>Priority: Minor
>
> I found an issue where filtering partitions using a dataframe results in very 
> bad performance.
> To reproduce:
> Create a hive table with a lot of partitions and write a spark query on that 
> table that filters based on the partition column.
> In my use case I've got a table with about 30k partitions. 
> I filter the partitions using some scala via spark-shell:
> {{table.filter("partition=x or partition=y")}}
> This results in a Hive thrift API call:{{ #get_partitions('db', 'table', 
> -1)}} which is very slow (minutes) and loads all metastore partitions in 
> memory.
> Doing a more simple filter:
> {{table.filter("partition=x)}} 
> Results in a Hive Thrift API call:{{ #get_partitions_by_filter('db', 'table', 
> 'partition = "x', -1)}} which is very fast (seconds) and only fetches 
> partition X into memory.
> If possible Spark should translate the filter into the more performant Thrift 
> call or fallback to a more scalable solution where it filters our partitions 
> without having to loading them all into memory first (for instance fetching 
> the partitions in batches).
> I've posted my original question on 
> [SO|https://stackoverflow.com/questions/46152526/how-should-i-configure-spark-to-correctly-prune-hive-metastore-partitions]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22229) SPIP: RDMA Accelerated Shuffle Engine

2017-10-12 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16201950#comment-16201950
 ] 

Saisai Shao commented on SPARK-9:
-

My concern is about how to maintain this code in the community? RDMA thing is 
not so well-known as Netty/Socket program, I'm not sure if there're lots of 
devs understand it and can fully leverage it, also how test on a commodity 
machine without RDMA support? I'm afraid if the code is seldom used and 
maintained, it will gradually become obsolete and buggy.

> SPIP: RDMA Accelerated Shuffle Engine
> -
>
> Key: SPARK-9
> URL: https://issues.apache.org/jira/browse/SPARK-9
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Yuval Degani
> Attachments: 
> SPARK-9_SPIP_RDMA_Accelerated_Shuffle_Engine_Rev_1.0.pdf
>
>
> An RDMA-accelerated shuffle engine can provide enormous performance benefits 
> to shuffle-intensive Spark jobs, as demonstrated in the “SparkRDMA” plugin 
> open-source project ([https://github.com/Mellanox/SparkRDMA]).
> Using RDMA for shuffle improves CPU utilization significantly and reduces I/O 
> processing overhead by bypassing the kernel and networking stack as well as 
> avoiding memory copies entirely. Those valuable CPU cycles are then consumed 
> directly by the actual Spark workloads, and help reducing the job runtime 
> significantly. 
> This performance gain is demonstrated with both industry standard HiBench 
> TeraSort (shows 1.5x speedup in sorting) as well as shuffle intensive 
> customer applications. 
> SparkRDMA will be presented at Spark Summit 2017 in Dublin 
> ([https://spark-summit.org/eu-2017/events/accelerating-shuffle-a-tailor-made-rdma-solution-for-apache-spark/]).
> Please see attached proposal document for more information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-10-12 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16201917#comment-16201917
 ] 

Steve Loughran commented on SPARK-21797:


Update, in HADOOP-14874 I've noted we could use the existing 
{{FileSystem.getContentSummary(Path)}} API to return {{StorageType.ARCHIVE}} 
for glaciated data. You'd need a way of filtering the listing of source files 
to strip out everything of archive type, but then yes, you could skip data in 
glacier

> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: Amazon EMR
>Reporter: Boris Clémençon 
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22197) push down operators to data source before planning

2017-10-12 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22197.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19424
[https://github.com/apache/spark/pull/19424]

> push down operators to data source before planning
> --
>
> Key: SPARK-22197
> URL: https://issues.apache.org/jira/browse/SPARK-22197
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22097) Request an accurate memory after we unrolled the block

2017-10-12 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22097:

Affects Version/s: (was: 2.2.0)
   2.3.0

> Request an accurate memory after we unrolled the block
> --
>
> Key: SPARK-22097
> URL: https://issues.apache.org/jira/browse/SPARK-22097
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Xianyang Liu
> Fix For: 2.3.0
>
>
> We only need request bbos.size - unrollMemoryUsedByThisBlock after unrolled 
> the block.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22097) Request an accurate memory after we unrolled the block

2017-10-12 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22097.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19316
[https://github.com/apache/spark/pull/19316]

> Request an accurate memory after we unrolled the block
> --
>
> Key: SPARK-22097
> URL: https://issues.apache.org/jira/browse/SPARK-22097
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Xianyang Liu
> Fix For: 2.3.0
>
>
> We only need request bbos.size - unrollMemoryUsedByThisBlock after unrolled 
> the block.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22242) streaming job failed to restart from checkpoint

2017-10-12 Thread StephenZou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16201868#comment-16201868
 ] 

StephenZou commented on SPARK-22242:


Perhaps, plz not to forget to add spark.yarn.jars if fixed there. 

> streaming job failed to restart from checkpoint
> ---
>
> Key: SPARK-22242
> URL: https://issues.apache.org/jira/browse/SPARK-22242
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0, 2.2.0
>Reporter: StephenZou
>
> My spark-defaults.conf has an item related to the issue, I upload all jars in 
> spark's jars folder to the hdfs path:
> spark.yarn.jars  hdfs:///spark/cache/spark2.2/* 
> Streaming job failed to restart from checkpoint, ApplicationMaster throws  
> "Error: Could not find or load main class 
> org.apache.spark.deploy.yarn.ExecutorLauncher".  The problem is always 
> reproducible.
> I examine the sparkconf object recovered from checkpoint, and find 
> spark.yarn.jars are set empty, which let all jars not exist in AM side. The 
> solution is spark.yarn.jars should be reload from properties files when 
> recovering from checkpoint. 
> attach is a demo to reproduce the issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22240) S3 CSV number of partitions incorrectly computed

2017-10-12 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16201863#comment-16201863
 ] 

Steve Loughran commented on SPARK-22240:


thanks. Now for a question which is probably obvious to others:

Is there a straightforward way to determine the number of partitions a file is 
being broken into other than doing a job & counting the # of part- files? I 
guess even if there is, counting the files is straightforward. 

If it's multiline related, do you think this would apply to all filesystems, or 
is it possible that s3a is making things worse?

Anyway, whatever happens here, we'll get a fix for s3a's getBlockLocations in 
shortly

> S3 CSV number of partitions incorrectly computed
> 
>
> Key: SPARK-22240
> URL: https://issues.apache.org/jira/browse/SPARK-22240
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: Running on EMR 5.8.0 with Hadoop 2.7.3 and Spark 2.2.0
>Reporter: Arthur Baudry
>
> Reading CSV out of S3 using S3A protocol does not compute the number of 
> partitions correctly in Spark 2.2.0.
> With Spark 2.2.0 I get only partition when loading a 14GB file
> {code:java}
> scala> val input = spark.read.format("csv").option("header", 
> "true").option("delimiter", "|").option("multiLine", 
> "true").load("s3a://")
> input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: 
> string ... 36 more fields]
> scala> input.rdd.getNumPartitions
> res2: Int = 1
> {code}
> While in Spark 2.0.2 I had:
> {code:java}
> scala> val input = spark.read.format("csv").option("header", 
> "true").option("delimiter", "|").option("multiLine", 
> "true").load("s3a://")
> input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: 
> string ... 36 more fields]
> scala> input.rdd.getNumPartitions
> res2: Int = 115
> {code}
> This introduces obvious performance issues in Spark 2.2.0. Maybe there is a 
> property that should be set to have the number of partitions computed 
> correctly.
> I'm aware that the .option("multiline","true") is not supported in Spark 
> 2.0.2, it's not relevant here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22252) FileFormatWriter should respect the input query schema

2017-10-12 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22252.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19474
[https://github.com/apache/spark/pull/19474]

> FileFormatWriter should respect the input query schema
> --
>
> Key: SPARK-22252
> URL: https://issues.apache.org/jira/browse/SPARK-22252
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21646) Add new type coercion rules to compatible with Hive

2017-10-12 Thread Yuming Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-21646:

Attachment: Type_coercion_rules_to_compatible_with_Hive.pdf

> Add new type coercion rules to compatible with Hive
> ---
>
> Key: SPARK-21646
> URL: https://issues.apache.org/jira/browse/SPARK-21646
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
> Attachments: Type_coercion_rules_to_compatible_with_Hive.pdf
>
>
> How to reproduce:
> hive:
> {code:sql}
> $ hive -S
> hive> create table spark_21646(c1 string, c2 string);
> hive> insert into spark_21646 values('92233720368547758071', 'a');
> hive> insert into spark_21646 values('21474836471', 'b');
> hive> insert into spark_21646 values('10', 'c');
> hive> select * from spark_21646 where c1 > 0;
> 92233720368547758071  a
> 10c
> 21474836471   b
> hive>
> {code}
> spark-sql:
> {code:sql}
> $ spark-sql -S
> spark-sql> select * from spark_21646 where c1 > 0;
> 10  c 
>   
> spark-sql> select * from spark_21646 where c1 > 0L;
> 21474836471   b
> 10c
> spark-sql> explain select * from spark_21646 where c1 > 0;
> == Physical Plan ==
> *Project [c1#14, c2#15]
> +- *Filter (isnotnull(c1#14) && (cast(c1#14 as int) > 0))
>+- *FileScan parquet spark_21646[c1#14,c2#15] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[viewfs://cluster4/user/hive/warehouse/spark_21646], 
> PartitionFilters: [], PushedFilters: [IsNotNull(c1)], ReadSchema: 
> struct
> spark-sql> 
> {code}
> As you can see, spark auto cast c1 to int type, if this value out of integer 
> range, the result is different from Hive.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22264) History server will be unavailable if there is an event log file with large size

2017-10-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22264:


Assignee: Apache Spark

> History server will be unavailable if there is an event log file with large 
> size
> 
>
> Key: SPARK-22264
> URL: https://issues.apache.org/jira/browse/SPARK-22264
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.0
>Reporter: zhoukang
>Assignee: Apache Spark
> Attachments: not-found.png
>
>
> History server will be unavailable if there is an event log file with large 
> size.
> Large size here means the replaying time is too long.
> *We can fix this to add a timeout for event log replaying.*
> *Here is an example:*
> Every application submitted after restart can not open history ui.
> !not-found.png!
> *From event log directory we can find an event log file size is bigger than 
> 130GB.*
> {code:java}
> hadoop *144149840801* 2017-08-29 14:03 
> /spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
> {code}
> *and from jstack and server log we can see replaying task blocked on this 
> event log:*
> *server log:*
> {code:java}
> 2017-10-12,16:00:12,151 INFO 
> org.apache.spark.deploy.history.FsHistoryProvider: Replaying log path: 
> hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
> 2017-10-12,16:00:12,167 INFO org.apache.spark.scheduler.ReplayListenerBus: 
> Begin to replay 
> hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress!
> {code}
> *jstack*
> {code:java}
> "log-replay-executor-0" daemon prio=10 tid=0x7f0f48014800 nid=0x6160 
> runnable [0x7f0f4f6f5000]
>java.lang.Thread.State: RUNNABLE
> at net.jpountz.lz4.LZ4JNI.LZ4_decompress_fast(Native Method)
> at 
> net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:37)
> at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:205)
> at 
> org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125)
> at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283)
> at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325)
> at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
> - locked <0x0005f0096948> (a java.io.InputStreamReader)
> at java.io.InputStreamReader.read(InputStreamReader.java:184)
> at java.io.BufferedReader.fill(BufferedReader.java:154)
> at java.io.BufferedReader.readLine(BufferedReader.java:317)
> - locked <0x0005f0096948> (a java.io.InputStreamReader)
> at java.io.BufferedReader.readLine(BufferedReader.java:382)
> at 
> scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72)
> at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:836)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
> at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:79)
> at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:776)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:584)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$3$$anon$4.run(FsHistoryProvider.scala:464)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22264) History server will be unavailable if there is an event log file with large size

2017-10-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22264:


Assignee: (was: Apache Spark)

> History server will be unavailable if there is an event log file with large 
> size
> 
>
> Key: SPARK-22264
> URL: https://issues.apache.org/jira/browse/SPARK-22264
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.0
>Reporter: zhoukang
> Attachments: not-found.png
>
>
> History server will be unavailable if there is an event log file with large 
> size.
> Large size here means the replaying time is too long.
> *We can fix this to add a timeout for event log replaying.*
> *Here is an example:*
> Every application submitted after restart can not open history ui.
> !not-found.png!
> *From event log directory we can find an event log file size is bigger than 
> 130GB.*
> {code:java}
> hadoop *144149840801* 2017-08-29 14:03 
> /spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
> {code}
> *and from jstack and server log we can see replaying task blocked on this 
> event log:*
> *server log:*
> {code:java}
> 2017-10-12,16:00:12,151 INFO 
> org.apache.spark.deploy.history.FsHistoryProvider: Replaying log path: 
> hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
> 2017-10-12,16:00:12,167 INFO org.apache.spark.scheduler.ReplayListenerBus: 
> Begin to replay 
> hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress!
> {code}
> *jstack*
> {code:java}
> "log-replay-executor-0" daemon prio=10 tid=0x7f0f48014800 nid=0x6160 
> runnable [0x7f0f4f6f5000]
>java.lang.Thread.State: RUNNABLE
> at net.jpountz.lz4.LZ4JNI.LZ4_decompress_fast(Native Method)
> at 
> net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:37)
> at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:205)
> at 
> org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125)
> at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283)
> at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325)
> at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
> - locked <0x0005f0096948> (a java.io.InputStreamReader)
> at java.io.InputStreamReader.read(InputStreamReader.java:184)
> at java.io.BufferedReader.fill(BufferedReader.java:154)
> at java.io.BufferedReader.readLine(BufferedReader.java:317)
> - locked <0x0005f0096948> (a java.io.InputStreamReader)
> at java.io.BufferedReader.readLine(BufferedReader.java:382)
> at 
> scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72)
> at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:836)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
> at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:79)
> at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:776)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:584)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$3$$anon$4.run(FsHistoryProvider.scala:464)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22264) History server will be unavailable if there is an event log file with large size

2017-10-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16201741#comment-16201741
 ] 

Apache Spark commented on SPARK-22264:
--

User 'caneGuy' has created a pull request for this issue:
https://github.com/apache/spark/pull/19482

> History server will be unavailable if there is an event log file with large 
> size
> 
>
> Key: SPARK-22264
> URL: https://issues.apache.org/jira/browse/SPARK-22264
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.0
>Reporter: zhoukang
> Attachments: not-found.png
>
>
> History server will be unavailable if there is an event log file with large 
> size.
> Large size here means the replaying time is too long.
> *We can fix this to add a timeout for event log replaying.*
> *Here is an example:*
> Every application submitted after restart can not open history ui.
> !not-found.png!
> *From event log directory we can find an event log file size is bigger than 
> 130GB.*
> {code:java}
> hadoop *144149840801* 2017-08-29 14:03 
> /spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
> {code}
> *and from jstack and server log we can see replaying task blocked on this 
> event log:*
> *server log:*
> {code:java}
> 2017-10-12,16:00:12,151 INFO 
> org.apache.spark.deploy.history.FsHistoryProvider: Replaying log path: 
> hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
> 2017-10-12,16:00:12,167 INFO org.apache.spark.scheduler.ReplayListenerBus: 
> Begin to replay 
> hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress!
> {code}
> *jstack*
> {code:java}
> "log-replay-executor-0" daemon prio=10 tid=0x7f0f48014800 nid=0x6160 
> runnable [0x7f0f4f6f5000]
>java.lang.Thread.State: RUNNABLE
> at net.jpountz.lz4.LZ4JNI.LZ4_decompress_fast(Native Method)
> at 
> net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:37)
> at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:205)
> at 
> org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125)
> at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283)
> at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325)
> at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
> - locked <0x0005f0096948> (a java.io.InputStreamReader)
> at java.io.InputStreamReader.read(InputStreamReader.java:184)
> at java.io.BufferedReader.fill(BufferedReader.java:154)
> at java.io.BufferedReader.readLine(BufferedReader.java:317)
> - locked <0x0005f0096948> (a java.io.InputStreamReader)
> at java.io.BufferedReader.readLine(BufferedReader.java:382)
> at 
> scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72)
> at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:836)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
> at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:79)
> at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:776)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:584)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$3$$anon$4.run(FsHistoryProvider.scala:464)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22264) History server will be unavailable if there is an event log file with large size

2017-10-12 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-22264:
-
Description: 
History server will be unavailable if there is an event log file with large 
size.
Large size here means the replaying time is too long.
*We can fix this to add a timeout for event log replaying.*
*Here is an example:*
Every application submitted after restart can not open history ui.
!not-found.png!

*From event log directory we can find an event log file size is bigger than 
130GB.*
{code:java}
hadoop *144149840801* 2017-08-29 14:03 
/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
{code}

*and from jstack and server log we can see replaying task blocked on this event 
log:*
*server log:*
{code:java}
2017-10-12,16:00:12,151 INFO org.apache.spark.deploy.history.FsHistoryProvider: 
Replaying log path: 
hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
2017-10-12,16:00:12,167 INFO org.apache.spark.scheduler.ReplayListenerBus: 
Begin to replay 
hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress!
{code}

*jstack*
{code:java}
"log-replay-executor-0" daemon prio=10 tid=0x7f0f48014800 nid=0x6160 
runnable [0x7f0f4f6f5000]
   java.lang.Thread.State: RUNNABLE
at net.jpountz.lz4.LZ4JNI.LZ4_decompress_fast(Native Method)
at 
net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:37)
at 
org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:205)
at 
org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
- locked <0x0005f0096948> (a java.io.InputStreamReader)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:154)
at java.io.BufferedReader.readLine(BufferedReader.java:317)
- locked <0x0005f0096948> (a java.io.InputStreamReader)
at java.io.BufferedReader.readLine(BufferedReader.java:382)
at 
scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72)
at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:836)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at 
org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:79)
at 
org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
at 
org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:776)
at 
org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:584)
at 
org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$3$$anon$4.run(FsHistoryProvider.scala:464)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}



  was:
History server will be unavailable if there is an event log file with large 
size.
Large size here means the replaying time is too long.
We can fix this to add a timeout for event log replaying.
Here is an example:
Every application submitted after restart can not open history ui.
!not-found.png!

*From event log directory we can find an event log file size is bigger than 
130GB.*
{code:java}
hadoop *144149840801* 2017-08-29 14:03 
/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
{code}

*and from jstack and server log we can see replaying task blocked on this event 
log:*
*server log:*
{code:java}
2017-10-12,16:00:12,151 INFO org.apache.spark.deploy.history.FsHistoryProvider: 
Replaying log path: 
hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
2017-10-12,16:00:12,167 INFO org.apache.spark.scheduler.ReplayListenerBus: 
Begin to replay 
hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress!
{code}

*jstack*
{code:java}
"log-replay-executor-0" daemon prio=10 tid=0x7f0f48014800 nid=0x6160 
runnable [0x7f0f4f6f5000]
   java.lang.Thread.State: RUNNABLE
at net.jpountz.lz4.LZ4JNI.LZ4_decompress_fast(Native Method)
at 
net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:37)
at 

[jira] [Updated] (SPARK-22264) History server will be unavailable if there is an event log file with large size

2017-10-12 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-22264:
-
Description: 
History server will be unavailable if there is an event log file with large 
size.
Large size here means the replaying time is too long.
We can fix this to add a timeout for event log replaying.
Here is an example:
Every application submitted after restart can not open history ui.
!not-found.png!

*From event log directory we can find an event log file size is bigger than 
130GB.*
{code:java}
hadoop *144149840801* 2017-08-29 14:03 
/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
{code}

*and from jstack and server log we can see replaying task blocked on this event 
log:*
*server log:*
{code:java}
2017-10-12,16:00:12,151 INFO org.apache.spark.deploy.history.FsHistoryProvider: 
Replaying log path: 
hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
2017-10-12,16:00:12,167 INFO org.apache.spark.scheduler.ReplayListenerBus: 
Begin to replay 
hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress!
{code}

*jstack*
{code:java}
"log-replay-executor-0" daemon prio=10 tid=0x7f0f48014800 nid=0x6160 
runnable [0x7f0f4f6f5000]
   java.lang.Thread.State: RUNNABLE
at net.jpountz.lz4.LZ4JNI.LZ4_decompress_fast(Native Method)
at 
net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:37)
at 
org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:205)
at 
org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
- locked <0x0005f0096948> (a java.io.InputStreamReader)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:154)
at java.io.BufferedReader.readLine(BufferedReader.java:317)
- locked <0x0005f0096948> (a java.io.InputStreamReader)
at java.io.BufferedReader.readLine(BufferedReader.java:382)
at 
scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72)
at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:836)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at 
org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:79)
at 
org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
at 
org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:776)
at 
org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:584)
at 
org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$3$$anon$4.run(FsHistoryProvider.scala:464)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}



  was:
History server will be unavailable if there is an event log file with large 
size.
Large size here means the replaying time is too long.
We can fix this to add a timeout for event log replaying.
Here is an example:
Every application submitted after restart can not open history ui.
!not-found.png!

*From event log directory we can find an event log file size is bigger than 
130GB.
*{code:java}
hadoop *144149840801* 2017-08-29 14:03 
/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
{code}

*and from jstack and server log we can see replaying task blocked on this event 
log:
**server log:*
{code:java}
2017-10-12,16:00:12,151 INFO org.apache.spark.deploy.history.FsHistoryProvider: 
Replaying log path: 
hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
2017-10-12,16:00:12,167 INFO org.apache.spark.scheduler.ReplayListenerBus: 
Begin to replay 
hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress!
{code}

*jstack*
{code:java}
"log-replay-executor-0" daemon prio=10 tid=0x7f0f48014800 nid=0x6160 
runnable [0x7f0f4f6f5000]
   java.lang.Thread.State: RUNNABLE
at net.jpountz.lz4.LZ4JNI.LZ4_decompress_fast(Native Method)
at 
net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:37)
at 

[jira] [Updated] (SPARK-22264) History server will be unavailable if there is an event log file with large size

2017-10-12 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-22264:
-
Description: 
History server will be unavailable if there is an event log file with large 
size.
Large size here means the replaying time is too long.
We can fix this to add a timeout for event log replaying.
Here is an example:
Every application submitted after restart can not open history ui.
!not-found.png!
Event log file size is bigger than 130GB.
{code:java}
hadoop *144149840801* 2017-08-29 14:03 
/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
{code}

and from jstack and server log we can see replaying task blocked on this event 
log:
server log:
{code:java}
2017-10-12,16:00:12,151 INFO org.apache.spark.deploy.history.FsHistoryProvider: 
Replaying log path: 
hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
2017-10-12,16:00:12,167 INFO org.apache.spark.scheduler.ReplayListenerBus: 
Begin to replay 
hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress!
{code}

jstack
{code:java}
"log-replay-executor-0" daemon prio=10 tid=0x7f0f48014800 nid=0x6160 
runnable [0x7f0f4f6f5000]
   java.lang.Thread.State: RUNNABLE
at net.jpountz.lz4.LZ4JNI.LZ4_decompress_fast(Native Method)
at 
net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:37)
at 
org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:205)
at 
org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
- locked <0x0005f0096948> (a java.io.InputStreamReader)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:154)
at java.io.BufferedReader.readLine(BufferedReader.java:317)
- locked <0x0005f0096948> (a java.io.InputStreamReader)
at java.io.BufferedReader.readLine(BufferedReader.java:382)
at 
scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72)
at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:836)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at 
org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:79)
at 
org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
at 
org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:776)
at 
org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:584)
at 
org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$3$$anon$4.run(FsHistoryProvider.scala:464)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}



  was:
History server will be unavailable if there is an event log file with large 
size.
Large size here means the replaying time is too long.
We can fix this to add a timeout for event log replaying.
Here is an example:
Every application submitted after restart can not open history ui.

Event log file size is bigger than 130GB.
{code:java}
hadoop *144149840801* 2017-08-29 14:03 
/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
{code}

and from jstack and server log we can see replaying task blocked on this event 
log:
server log:
{code:java}
2017-10-12,16:00:12,151 INFO org.apache.spark.deploy.history.FsHistoryProvider: 
Replaying log path: 
hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
2017-10-12,16:00:12,167 INFO org.apache.spark.scheduler.ReplayListenerBus: 
Begin to replay 
hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress!
{code}

jstack
{code:java}
"log-replay-executor-0" daemon prio=10 tid=0x7f0f48014800 nid=0x6160 
runnable [0x7f0f4f6f5000]
   java.lang.Thread.State: RUNNABLE
at net.jpountz.lz4.LZ4JNI.LZ4_decompress_fast(Native Method)
at 
net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:37)
at 
org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:205)
at 
org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125)
at 

[jira] [Updated] (SPARK-22264) History server will be unavailable if there is an event log file with large size

2017-10-12 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-22264:
-
Description: 
History server will be unavailable if there is an event log file with large 
size.
Large size here means the replaying time is too long.
We can fix this to add a timeout for event log replaying.
Here is an example:
Every application submitted after restart can not open history ui.
!not-found.png!

*From event log directory we can find an event log file size is bigger than 
130GB.
*{code:java}
hadoop *144149840801* 2017-08-29 14:03 
/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
{code}

*and from jstack and server log we can see replaying task blocked on this event 
log:
**server log:*
{code:java}
2017-10-12,16:00:12,151 INFO org.apache.spark.deploy.history.FsHistoryProvider: 
Replaying log path: 
hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
2017-10-12,16:00:12,167 INFO org.apache.spark.scheduler.ReplayListenerBus: 
Begin to replay 
hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress!
{code}

*jstack*
{code:java}
"log-replay-executor-0" daemon prio=10 tid=0x7f0f48014800 nid=0x6160 
runnable [0x7f0f4f6f5000]
   java.lang.Thread.State: RUNNABLE
at net.jpountz.lz4.LZ4JNI.LZ4_decompress_fast(Native Method)
at 
net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:37)
at 
org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:205)
at 
org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
- locked <0x0005f0096948> (a java.io.InputStreamReader)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:154)
at java.io.BufferedReader.readLine(BufferedReader.java:317)
- locked <0x0005f0096948> (a java.io.InputStreamReader)
at java.io.BufferedReader.readLine(BufferedReader.java:382)
at 
scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72)
at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:836)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at 
org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:79)
at 
org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
at 
org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:776)
at 
org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:584)
at 
org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$3$$anon$4.run(FsHistoryProvider.scala:464)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}



  was:
History server will be unavailable if there is an event log file with large 
size.
Large size here means the replaying time is too long.
We can fix this to add a timeout for event log replaying.
Here is an example:
Every application submitted after restart can not open history ui.
!not-found.png!
Event log file size is bigger than 130GB.
{code:java}
hadoop *144149840801* 2017-08-29 14:03 
/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
{code}

and from jstack and server log we can see replaying task blocked on this event 
log:
server log:
{code:java}
2017-10-12,16:00:12,151 INFO org.apache.spark.deploy.history.FsHistoryProvider: 
Replaying log path: 
hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
2017-10-12,16:00:12,167 INFO org.apache.spark.scheduler.ReplayListenerBus: 
Begin to replay 
hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress!
{code}

jstack
{code:java}
"log-replay-executor-0" daemon prio=10 tid=0x7f0f48014800 nid=0x6160 
runnable [0x7f0f4f6f5000]
   java.lang.Thread.State: RUNNABLE
at net.jpountz.lz4.LZ4JNI.LZ4_decompress_fast(Native Method)
at 
net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:37)
at 
org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:205)
at 

[jira] [Updated] (SPARK-22264) History server will be unavailable if there is an event log file with large size

2017-10-12 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-22264:
-
Description: 
History server will be unavailable if there is an event log file with large 
size.
Large size here means the replaying time is too long.
We can fix this to add a timeout for event log replaying.
Here is an example:
Every application submitted after restart can not open history ui.

Event log file size is bigger than 130GB.
{code:java}
hadoop *144149840801* 2017-08-29 14:03 
/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
{code}

and from jstack and server log we can see replaying task blocked on this event 
log:
server log:
{code:java}
2017-10-12,16:00:12,151 INFO org.apache.spark.deploy.history.FsHistoryProvider: 
Replaying log path: 
hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
2017-10-12,16:00:12,167 INFO org.apache.spark.scheduler.ReplayListenerBus: 
Begin to replay 
hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress!
{code}

jstack
{code:java}
"log-replay-executor-0" daemon prio=10 tid=0x7f0f48014800 nid=0x6160 
runnable [0x7f0f4f6f5000]
   java.lang.Thread.State: RUNNABLE
at net.jpountz.lz4.LZ4JNI.LZ4_decompress_fast(Native Method)
at 
net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:37)
at 
org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:205)
at 
org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
- locked <0x0005f0096948> (a java.io.InputStreamReader)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:154)
at java.io.BufferedReader.readLine(BufferedReader.java:317)
- locked <0x0005f0096948> (a java.io.InputStreamReader)
at java.io.BufferedReader.readLine(BufferedReader.java:382)
at 
scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72)
at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:836)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at 
org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:79)
at 
org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
at 
org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:776)
at 
org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:584)
at 
org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$3$$anon$4.run(FsHistoryProvider.scala:464)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}



  was:
History server will be unavailable if there is an event log file with large 
size.
Large size here means the replaying time is too long.
We can fix this to add a timeout for event log replaying.
Here is an example:
Event log file size is bigger than 130GB.

{code:java}
hadoop *144149840801* 2017-08-29 14:03 
/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
{code}




> History server will be unavailable if there is an event log file with large 
> size
> 
>
> Key: SPARK-22264
> URL: https://issues.apache.org/jira/browse/SPARK-22264
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.0
>Reporter: zhoukang
> Attachments: not-found.png
>
>
> History server will be unavailable if there is an event log file with large 
> size.
> Large size here means the replaying time is too long.
> We can fix this to add a timeout for event log replaying.
> Here is an example:
> Every application submitted after restart can not open history ui.
> Event log file size is bigger than 130GB.
> {code:java}
> hadoop *144149840801* 2017-08-29 14:03 
> /spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
> {code}
> and from jstack and server log we can see replaying task blocked on this 
> event log:
> server log:
> {code:java}
> 

[jira] [Updated] (SPARK-22264) History server will be unavailable if there is an event log file with large size

2017-10-12 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-22264:
-
Attachment: not-found.png

> History server will be unavailable if there is an event log file with large 
> size
> 
>
> Key: SPARK-22264
> URL: https://issues.apache.org/jira/browse/SPARK-22264
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.0
>Reporter: zhoukang
> Attachments: not-found.png
>
>
> History server will be unavailable if there is an event log file with large 
> size.
> Large size here means the replaying time is too long.
> We can fix this to add a timeout for event log replaying.
> Here is an example:
> Every application submitted after restart can not open history ui.
> Event log file size is bigger than 130GB.
> {code:java}
> hadoop *144149840801* 2017-08-29 14:03 
> /spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
> {code}
> and from jstack and server log we can see replaying task blocked on this 
> event log:
> server log:
> {code:java}
> 2017-10-12,16:00:12,151 INFO 
> org.apache.spark.deploy.history.FsHistoryProvider: Replaying log path: 
> hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
> 2017-10-12,16:00:12,167 INFO org.apache.spark.scheduler.ReplayListenerBus: 
> Begin to replay 
> hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress!
> {code}
> jstack
> {code:java}
> "log-replay-executor-0" daemon prio=10 tid=0x7f0f48014800 nid=0x6160 
> runnable [0x7f0f4f6f5000]
>java.lang.Thread.State: RUNNABLE
> at net.jpountz.lz4.LZ4JNI.LZ4_decompress_fast(Native Method)
> at 
> net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:37)
> at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:205)
> at 
> org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125)
> at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283)
> at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325)
> at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
> - locked <0x0005f0096948> (a java.io.InputStreamReader)
> at java.io.InputStreamReader.read(InputStreamReader.java:184)
> at java.io.BufferedReader.fill(BufferedReader.java:154)
> at java.io.BufferedReader.readLine(BufferedReader.java:317)
> - locked <0x0005f0096948> (a java.io.InputStreamReader)
> at java.io.BufferedReader.readLine(BufferedReader.java:382)
> at 
> scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72)
> at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:836)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
> at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:79)
> at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:776)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:584)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$3$$anon$4.run(FsHistoryProvider.scala:464)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22264) History server will be unavailable if there is an event log file with large size

2017-10-12 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-22264:
-
Description: 
History server will be unavailable if there is an event log file with large 
size.
Large size here means the replaying time is too long.
We can fix this to add a timeout for event log replaying.
Here is an example:
Event log file size is bigger than 130GB.

{code:java}
hadoop *144149840801* 2017-08-29 14:03 
/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
{code}



  was:
History server will be unavailable if there is an event log file with large 
size.
Large size here means the replaying time is too long.
We can fix this to add a timeout for event log replaying.
Here is an example:
Event log file size is bigger than 130GB.



> History server will be unavailable if there is an event log file with large 
> size
> 
>
> Key: SPARK-22264
> URL: https://issues.apache.org/jira/browse/SPARK-22264
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.0
>Reporter: zhoukang
>
> History server will be unavailable if there is an event log file with large 
> size.
> Large size here means the replaying time is too long.
> We can fix this to add a timeout for event log replaying.
> Here is an example:
> Event log file size is bigger than 130GB.
> {code:java}
> hadoop *144149840801* 2017-08-29 14:03 
> /spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >