date:20190103

[jira] [Created] (SPARK-26533) Support query auto cancel on thriftserver

2019-01-03 Thread zhoukang (JIRA)

zhoukang created SPARK-26533:


 Summary: Support query auto cancel on thriftserver
 Key: SPARK-26533
 URL: https://issues.apache.org/jira/browse/SPARK-26533
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: zhoukang


Support query auto cancelling when running too long on thriftserver.
For some cases,we use thriftserver as long-running applications.
Some times we want all the query need not to run more than given time.
In these cases,we can enable auto cancel for time-consumed query.Which can let 
us release resources for other queries to run.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26126) Should put scala-library deps into root pom instead of spark-tags module

2019-01-03 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733848#comment-16733848
 ] 

Hyukjin Kwon commented on SPARK-26126:
--

There are sometimes some reasons and it needs a lot of efforts to check the 
history of, in particular, old codes. For instance, some modules could no need 
for {{spark-tags}} intentionally, and use module specific annotation.
If you can track the history, check why it was added there, and are able to 
elaborate why changing is correct, please go ahead for a PR. Don't forget to 
double check SBT build as well.
Otherwise, let's just leave this resolved.

> Should put scala-library deps into root pom instead of spark-tags module
> 
>
> Key: SPARK-26126
> URL: https://issues.apache.org/jira/browse/SPARK-26126
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.0, 2.4.0
>Reporter: liupengcheng
>Priority: Minor
>
> When I do some backport in our custom spark, I notice some strange code from 
> spark-tags module:
> {code:java}
> 
>   
> org.scala-lang
> scala-library
> ${scala.version}
>   
> 
> {code}
> As i known, should spark-tags only contains some annotation related classes 
> or deps?
> should we put the scala-library deps to root pom?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26339) Behavior of reading files that start with underscore is confusing

2019-01-03 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733844#comment-16733844
 ] 

Apache Spark commented on SPARK-26339:
--

User 'KeiichiHirobe' has created a pull request for this issue:
https://github.com/apache/spark/pull/23446

> Behavior of reading files that start with underscore is confusing
> -
>
> Key: SPARK-26339
> URL: https://issues.apache.org/jira/browse/SPARK-26339
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Keiichi Hirobe
>Assignee: Keiichi Hirobe
>Priority: Minor
>
> Behavior of reading files that start with underscore is as follows.
>  # spark.read (no schema) throws exception which message is confusing.
>  # spark.read (userSpecificationSchema) succesfully reads, but content is 
> emtpy.
> Example of files are as follows.
>  The same behavior occured when I read json files.
> {code:bash}
> $ cat test.csv
> test1,10
> test2,20
> $ cp test.csv _test.csv
> $ ./bin/spark-shell  --master local[2]
> {code}
> Spark shell snippet for reproduction:
> {code:java}
> scala> val df=spark.read.csv("test.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string]
> scala> df.show()
> +-+---+
> |  _c0|_c1|
> +-+---+
> |test1| 10|
> |test2| 20|
> +-+---+
> scala> val df = spark.read.schema("test STRING, number INT").csv("test.csv")
> df: org.apache.spark.sql.DataFrame = [test: string, number: int]
> scala> df.show()
> +-+--+
> | test|number|
> +-+--+
> |test1|10|
> |test2|20|
> +-+--+
> scala> val df=spark.read.csv("_test.csv")
> org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It 
> must be specified manually.;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$13(DataSource.scala:185)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:185)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:231)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:219)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:625)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:478)
>   ... 49 elided
> scala> val df=spark.read.schema("test STRING, number INT").csv("_test.csv")
> df: org.apache.spark.sql.DataFrame = [test: string, number: int]
> scala> df.show()
> ++--+
> |test|number|
> ++--+
> ++--+
> {code}
> I noticed that spark cannot read files that start with underscore after I 
> read some codes.(I could not find any documents about file name limitation)
> Above behavior is not good especially userSpecificationSchema case, I think.
> I suggest to throw exception which message is "Path does not exist" in both 
> cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26339) Behavior of reading files that start with underscore is confusing

2019-01-03 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733846#comment-16733846
 ] 

Apache Spark commented on SPARK-26339:
--

User 'KeiichiHirobe' has created a pull request for this issue:
https://github.com/apache/spark/pull/23446

> Behavior of reading files that start with underscore is confusing
> -
>
> Key: SPARK-26339
> URL: https://issues.apache.org/jira/browse/SPARK-26339
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Keiichi Hirobe
>Assignee: Keiichi Hirobe
>Priority: Minor
>
> Behavior of reading files that start with underscore is as follows.
>  # spark.read (no schema) throws exception which message is confusing.
>  # spark.read (userSpecificationSchema) succesfully reads, but content is 
> emtpy.
> Example of files are as follows.
>  The same behavior occured when I read json files.
> {code:bash}
> $ cat test.csv
> test1,10
> test2,20
> $ cp test.csv _test.csv
> $ ./bin/spark-shell  --master local[2]
> {code}
> Spark shell snippet for reproduction:
> {code:java}
> scala> val df=spark.read.csv("test.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string]
> scala> df.show()
> +-+---+
> |  _c0|_c1|
> +-+---+
> |test1| 10|
> |test2| 20|
> +-+---+
> scala> val df = spark.read.schema("test STRING, number INT").csv("test.csv")
> df: org.apache.spark.sql.DataFrame = [test: string, number: int]
> scala> df.show()
> +-+--+
> | test|number|
> +-+--+
> |test1|10|
> |test2|20|
> +-+--+
> scala> val df=spark.read.csv("_test.csv")
> org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It 
> must be specified manually.;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$13(DataSource.scala:185)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:185)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:231)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:219)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:625)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:478)
>   ... 49 elided
> scala> val df=spark.read.schema("test STRING, number INT").csv("_test.csv")
> df: org.apache.spark.sql.DataFrame = [test: string, number: int]
> scala> df.show()
> ++--+
> |test|number|
> ++--+
> ++--+
> {code}
> I noticed that spark cannot read files that start with underscore after I 
> read some codes.(I could not find any documents about file name limitation)
> Above behavior is not good especially userSpecificationSchema case, I think.
> I suggest to throw exception which message is "Path does not exist" in both 
> cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26126) Should put scala-library deps into root pom instead of spark-tags module

2019-01-03 Thread liupengcheng (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733836#comment-16733836
 ] 

liupengcheng edited comment on SPARK-26126 at 1/4/19 5:41 AM:
--

[~hyukjin.kwon] Yes, there is no actual problem happening, but put the 
dependency of scala-library into spark-tags module is confusing.

If you agree that we shall put it into root pom for better understanding, I can 
put a PR for this issue or you can just leave it resovled.


was (Author: liupengcheng):
[~hyukjin.kwon] Yes, there is no actual problem happening, but put the 
dependency of scala-library into spark-tags module is confusing.

If you agree that we shall put it into root pom for better understanding, I can 
put a PR for this issue.

> Should put scala-library deps into root pom instead of spark-tags module
> 
>
> Key: SPARK-26126
> URL: https://issues.apache.org/jira/browse/SPARK-26126
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.0, 2.4.0
>Reporter: liupengcheng
>Priority: Minor
>
> When I do some backport in our custom spark, I notice some strange code from 
> spark-tags module:
> {code:java}
> 
>   
> org.scala-lang
> scala-library
> ${scala.version}
>   
> 
> {code}
> As i known, should spark-tags only contains some annotation related classes 
> or deps?
> should we put the scala-library deps to root pom?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26126) Should put scala-library deps into root pom instead of spark-tags module

2019-01-03 Thread liupengcheng (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733836#comment-16733836
 ] 

liupengcheng commented on SPARK-26126:
--

[~hyukjin.kwon] Yes, there is no actual problem happening, but put the 
dependency of scala-library into spark-tags module is confusing.

If you agree that we shall put it into root pom for better understanding, I can 
put a PR for this issue.

> Should put scala-library deps into root pom instead of spark-tags module
> 
>
> Key: SPARK-26126
> URL: https://issues.apache.org/jira/browse/SPARK-26126
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.0, 2.4.0
>Reporter: liupengcheng
>Priority: Minor
>
> When I do some backport in our custom spark, I notice some strange code from 
> spark-tags module:
> {code:java}
> 
>   
> org.scala-lang
> scala-library
> ${scala.version}
>   
> 
> {code}
> As i known, should spark-tags only contains some annotation related classes 
> or deps?
> should we put the scala-library deps to root pom?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26532) repartitionByRange reads source files twice

2019-01-03 Thread Mike Dias (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Dias updated SPARK-26532:
--
Component/s: SQL

> repartitionByRange reads source files twice
> ---
>
> Key: SPARK-26532
> URL: https://issues.apache.org/jira/browse/SPARK-26532
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Mike Dias
>Priority: Minor
> Attachments: repartition Stages.png, repartitionByRange Stages.png
>
>
> When using repartitionByRange in Structured Stream API for reading then write 
> files, it reads the source files twice. 
> Example:
> {code:java}
> val ds = spark.readStream.
>   format("text").
>   option("path", "data/streaming").
>   load
> val q = ds.
>   repartitionByRange(10, $"value").
>   writeStream.
>   format("parquet").
>   option("path", "/tmp/output").
>   option("checkpointLocation", "/tmp/checkpoint").
>   start()
> {code}
> This execution creates 3 stages: 2 for reading and 1 for writing, reading the 
> source twice. It's easy to see it in a large dataset where the reading 
> process time is doubled.
>  
> {code:java}
> $ curl -s -XGET 
> http://localhost:4040/api/v1/applications//stages
> {code}
>  
>  
> This is very different from the repartition strategy, which creates 2 stages: 
> 1 for reading and 1 for writing.
> {code:java}
> val ds = spark.readStream.
>   format("text").
>   option("path", "data/streaming").
>   load
> val q = ds.
>   repartition(10, $"value").
>   writeStream.
>   format("parquet").
>   option("path", "/tmp/output").
>   option("checkpointLocation", "/tmp/checkpoint").
>   start(){code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26532) repartitionByRange reads source files twice

2019-01-03 Thread Mike Dias (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Dias updated SPARK-26532:
--
Summary: repartitionByRange reads source files twice  (was: 
repartitionByRange strategy reads source files twice)

> repartitionByRange reads source files twice
> ---
>
> Key: SPARK-26532
> URL: https://issues.apache.org/jira/browse/SPARK-26532
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Mike Dias
>Priority: Minor
> Attachments: repartition Stages.png, repartitionByRange Stages.png
>
>
> When using repartitionByRange in Structured Stream API for reading then write 
> files, it reads the source files twice. 
> Example:
> {code:java}
> val ds = spark.readStream.
>   format("text").
>   option("path", "data/streaming").
>   load
> val q = ds.
>   repartitionByRange(10, $"value").
>   writeStream.
>   format("parquet").
>   option("path", "/tmp/output").
>   option("checkpointLocation", "/tmp/checkpoint").
>   start()
> {code}
> This execution creates 3 stages: 2 for reading and 1 for writing, reading the 
> source twice. It's easy to see it in a large dataset where the reading 
> process time is doubled.
>  
> {code:java}
> $ curl -s -XGET 
> http://localhost:4040/api/v1/applications//stages
> {code}
>  
>  
> This is very different from the repartition strategy, which creates 2 stages: 
> 1 for reading and 1 for writing.
> {code:java}
> val ds = spark.readStream.
>   format("text").
>   option("path", "data/streaming").
>   load
> val q = ds.
>   repartition(10, $"value").
>   writeStream.
>   format("parquet").
>   option("path", "/tmp/output").
>   option("checkpointLocation", "/tmp/checkpoint").
>   start(){code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26532) repartitionByRange strategy reads source files twice

2019-01-03 Thread Mike Dias (JIRA)

Mike Dias created SPARK-26532:
-

 Summary: repartitionByRange strategy reads source files twice
 Key: SPARK-26532
 URL: https://issues.apache.org/jira/browse/SPARK-26532
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.0, 2.3.2
Reporter: Mike Dias
 Attachments: repartition Stages.png, repartitionByRange Stages.png

When using repartitionByRange in Structured Stream API for reading then write 
files, it reads the source files twice. 

Example:
{code:java}
val ds = spark.readStream.
  format("text").
  option("path", "data/streaming").
  load

val q = ds.
  repartitionByRange(10, $"value").
  writeStream.
  format("parquet").
  option("path", "/tmp/output").
  option("checkpointLocation", "/tmp/checkpoint").
  start()
{code}
This execution creates 3 stages: 2 for reading and 1 for writing, reading the 
source twice. It's easy to see it in a large dataset where the reading process 
time is doubled.

 
{code:java}
$ curl -s -XGET http://localhost:4040/api/v1/applications//stages
{code}
 

 

This is very different from the repartition strategy, which creates 2 stages: 1 
for reading and 1 for writing.
{code:java}
val ds = spark.readStream.
  format("text").
  option("path", "data/streaming").
  load

val q = ds.
  repartition(10, $"value").
  writeStream.
  format("parquet").
  option("path", "/tmp/output").
  option("checkpointLocation", "/tmp/checkpoint").
  start(){code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26532) repartitionByRange strategy reads source files twice

2019-01-03 Thread Mike Dias (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Dias updated SPARK-26532:
--
Attachment: repartitionByRange Stages.png
repartition Stages.png

> repartitionByRange strategy reads source files twice
> 
>
> Key: SPARK-26532
> URL: https://issues.apache.org/jira/browse/SPARK-26532
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Mike Dias
>Priority: Minor
> Attachments: repartition Stages.png, repartitionByRange Stages.png
>
>
> When using repartitionByRange in Structured Stream API for reading then write 
> files, it reads the source files twice. 
> Example:
> {code:java}
> val ds = spark.readStream.
>   format("text").
>   option("path", "data/streaming").
>   load
> val q = ds.
>   repartitionByRange(10, $"value").
>   writeStream.
>   format("parquet").
>   option("path", "/tmp/output").
>   option("checkpointLocation", "/tmp/checkpoint").
>   start()
> {code}
> This execution creates 3 stages: 2 for reading and 1 for writing, reading the 
> source twice. It's easy to see it in a large dataset where the reading 
> process time is doubled.
>  
> {code:java}
> $ curl -s -XGET 
> http://localhost:4040/api/v1/applications//stages
> {code}
>  
>  
> This is very different from the repartition strategy, which creates 2 stages: 
> 1 for reading and 1 for writing.
> {code:java}
> val ds = spark.readStream.
>   format("text").
>   option("path", "data/streaming").
>   load
> val q = ds.
>   repartition(10, $"value").
>   writeStream.
>   format("parquet").
>   option("path", "/tmp/output").
>   option("checkpointLocation", "/tmp/checkpoint").
>   start(){code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26530) Validate heartheat arguments in SparkSubmitArguments

2019-01-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26530:


Assignee: Apache Spark

> Validate heartheat arguments in SparkSubmitArguments
> 
>
> Key: SPARK-26530
> URL: https://issues.apache.org/jira/browse/SPARK-26530
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: liupengcheng
>Assignee: Apache Spark
>Priority: Major
>
> Currently, heartbeat related arguments is not validated in spark, so if these 
> args are inproperly specified, the Application may run for a while and not 
> failed until the max executor failures reached(especially with 
> spark.dynamicAllocation.enabled=true), thus may incurs resources waste.
> We shall do validation before submit to cluster.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26530) Validate heartheat arguments in SparkSubmitArguments

2019-01-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26530:


Assignee: (was: Apache Spark)

> Validate heartheat arguments in SparkSubmitArguments
> 
>
> Key: SPARK-26530
> URL: https://issues.apache.org/jira/browse/SPARK-26530
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: liupengcheng
>Priority: Major
>
> Currently, heartbeat related arguments is not validated in spark, so if these 
> args are inproperly specified, the Application may run for a while and not 
> failed until the max executor failures reached(especially with 
> spark.dynamicAllocation.enabled=true), thus may incurs resources waste.
> We shall do validation before submit to cluster.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26531) How to setup Apache Spark to use local hard disk when data does not fit in RAM in local mode?

2019-01-03 Thread JIRA

Lê Văn Thanh created SPARK-26531:


 Summary: How to setup Apache Spark to use local hard disk when 
data does not fit in RAM in local mode?
 Key: SPARK-26531
 URL: https://issues.apache.org/jira/browse/SPARK-26531
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 2.2.0
 Environment: Ubuntu 16.04
Hive 2.3.0
Spark 2.0.0

 
Reporter: Lê Văn Thanh


Hello , 

I have a table with 10GB data and 4GB free RAM , I tried to select data from 
this table to other table with ORC format but I got an error about memory limit 
( I'm running the query SELECT in console ) . 

How to setup Apache Spark to use local hard disk when data does not fit in RAM 
in local mode?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26530) Validate heartheat arguments in SparkSubmitArguments

2019-01-03 Thread liupengcheng (JIRA)

liupengcheng created SPARK-26530:


 Summary: Validate heartheat arguments in SparkSubmitArguments
 Key: SPARK-26530
 URL: https://issues.apache.org/jira/browse/SPARK-26530
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, Spark Core
Affects Versions: 2.4.0, 2.3.2
Reporter: liupengcheng


Currently, heartbeat related arguments is not validated in spark, so if these 
args are inproperly specified, the Application may run for a while and not 
failed until the max executor failures reached(especially with 
spark.dynamicAllocation.enabled=true), thus may incurs resources waste.

We shall do validation before submit to cluster.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26389) temp checkpoint folder at executor should be deleted on graceful shutdown

2019-01-03 Thread Fengyu Cao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733815#comment-16733815
 ] 

Fengyu Cao edited comment on SPARK-26389 at 1/4/19 4:03 AM:


{quote}Temp checkpoint can be used in one-node scenario and deleted only if the 
query didn't fail.
{quote}
Yes, and there're no logs or error msgs says that we *must* set a non-temp 
checkpoint if we run a framework non-local

And if we do this(run non-local with temp checkpoint), the checkpoint dir on 
executor consume lots of space and not be deleted if the query fails, and this 
checkpoint can't be used to recover as I mentioned above.

I just think that spark either should prohibits users from using temp 
checkpoints when their frameworks are non-local, or should be responsible for 
cleaning up this useless checkpoint directory even if the query fails.

 

 


was (Author: camper42):
{quote}Temp checkpoint can be used in one-node scenario and deleted only if the 
query didn't fail.
{quote}
Yes, and there're no logs or error msgs says that we *must* set a non-temp 
checkpoint if we run a framework non-local

And if we do this(run non-local with temp checkpoint), the checkpoint dir on 
executor consume lots of space and not be deleted if the query if fail, and 
this checkpoint can't be used to recover as I mentioned above.

I just think that spark either should prohibits users from using temp 
checkpoints when their frameworks are non-local, or should be responsible for 
cleaning up this useless checkpoint directory even if the query fails.

 

 

> temp checkpoint folder at executor should be deleted on graceful shutdown
> -
>
> Key: SPARK-26389
> URL: https://issues.apache.org/jira/browse/SPARK-26389
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Fengyu Cao
>Priority: Major
>
> {{spark-submit --master mesos:// -conf 
> spark.streaming.stopGracefullyOnShutdown=true  framework>}}
> CTRL-C, framework shutdown
> {{18/12/18 10:27:36 ERROR MicroBatchExecution: Query [id = 
> f512e17a-df88-4414-a5cd-a23550cf1e7f, runId = 
> 24d99723-8d61-48c0-beab-af432f7a19d3] terminated with error 
> org.apache.spark.SparkException: Writing job aborted.}}
> {{/tmp/temporary- on executor not deleted due to 
> org.apache.spark.SparkException: Writing job aborted., and this temp 
> checkpoint can't used to recovery.}}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26389) temp checkpoint folder at executor should be deleted on graceful shutdown

2019-01-03 Thread Fengyu Cao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733815#comment-16733815
 ] 

Fengyu Cao commented on SPARK-26389:


{quote}Temp checkpoint can be used in one-node scenario and deleted only if the 
query didn't fail.
{quote}
Yes, and there're no logs or error msgs says that we *must* set a non-temp 
checkpoint if we run a framework non-local

And if we do this(run non-local with temp checkpoint), the checkpoint dir on 
executor consume lots of space and not be deleted if the query if fail, and 
this checkpoint can't be used to recover as I mentioned above.

I just think that spark either should prohibits users from using temp 
checkpoints when their frameworks are non-local, or should be responsible for 
cleaning up this useless checkpoint directory even if the query fails.

 

 

> temp checkpoint folder at executor should be deleted on graceful shutdown
> -
>
> Key: SPARK-26389
> URL: https://issues.apache.org/jira/browse/SPARK-26389
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Fengyu Cao
>Priority: Major
>
> {{spark-submit --master mesos:// -conf 
> spark.streaming.stopGracefullyOnShutdown=true  framework>}}
> CTRL-C, framework shutdown
> {{18/12/18 10:27:36 ERROR MicroBatchExecution: Query [id = 
> f512e17a-df88-4414-a5cd-a23550cf1e7f, runId = 
> 24d99723-8d61-48c0-beab-af432f7a19d3] terminated with error 
> org.apache.spark.SparkException: Writing job aborted.}}
> {{/tmp/temporary- on executor not deleted due to 
> org.apache.spark.SparkException: Writing job aborted., and this temp 
> checkpoint can't used to recovery.}}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26529) Add logs for IOException when preparing local resource

2019-01-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26529:


Assignee: Apache Spark

> Add logs for IOException when preparing local resource 
> ---
>
> Key: SPARK-26529
> URL: https://issues.apache.org/jira/browse/SPARK-26529
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: liupengcheng
>Assignee: Apache Spark
>Priority: Major
>
> Currently, `Client#createConfArchive` do not handle IOException, and some 
> detail info is not provided in logs. Sometimes, this may delay the time of 
> locating the root cause of io error.
> A case happened in our production environment is that local disk is full, and 
> the following exception is thrown but no detail path info provided. we have 
> to investigate all the local disk of the machine to find out the root cause.
> {code:java}
> Exception in thread "main" java.io.IOException: No space left on device
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:345)
> at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
> at java.util.zip.ZipOutputStream.closeEntry(ZipOutputStream.java:238)
> at java.util.zip.ZipOutputStream.finish(ZipOutputStream.java:343)
> at java.util.zip.DeflaterOutputStream.close(DeflaterOutputStream.java:238)
> at java.util.zip.ZipOutputStream.close(ZipOutputStream.java:360)
> at org.apache.spark.deploy.yarn.Client.createConfArchive(Client.scala:769)
> at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:657)
> at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:895)
> at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:177)
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:1202)
> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1261)
> at org.apache.spark.deploy.yarn.Client.main(Client.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:767)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:189)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:214)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:128)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}
> It make sense for us to catch the IOException and print some useful 
> information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26529) Add logs for IOException when preparing local resource

2019-01-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26529:


Assignee: (was: Apache Spark)

> Add logs for IOException when preparing local resource 
> ---
>
> Key: SPARK-26529
> URL: https://issues.apache.org/jira/browse/SPARK-26529
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: liupengcheng
>Priority: Major
>
> Currently, `Client#createConfArchive` do not handle IOException, and some 
> detail info is not provided in logs. Sometimes, this may delay the time of 
> locating the root cause of io error.
> A case happened in our production environment is that local disk is full, and 
> the following exception is thrown but no detail path info provided. we have 
> to investigate all the local disk of the machine to find out the root cause.
> {code:java}
> Exception in thread "main" java.io.IOException: No space left on device
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:345)
> at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
> at java.util.zip.ZipOutputStream.closeEntry(ZipOutputStream.java:238)
> at java.util.zip.ZipOutputStream.finish(ZipOutputStream.java:343)
> at java.util.zip.DeflaterOutputStream.close(DeflaterOutputStream.java:238)
> at java.util.zip.ZipOutputStream.close(ZipOutputStream.java:360)
> at org.apache.spark.deploy.yarn.Client.createConfArchive(Client.scala:769)
> at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:657)
> at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:895)
> at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:177)
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:1202)
> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1261)
> at org.apache.spark.deploy.yarn.Client.main(Client.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:767)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:189)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:214)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:128)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}
> It make sense for us to catch the IOException and print some useful 
> information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26075) Cannot broadcast the table that is larger than 8GB : Spark 2.3

2019-01-03 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733799#comment-16733799
 ] 

Hyukjin Kwon edited comment on SPARK-26075 at 1/4/19 3:27 AM:
--

[~bhadani.neeraj...@gmail.com] can you post reproducible codes that you ran? 
Let me take a quick look when i'm available.


was (Author: hyukjin.kwon):
[~bhadani.neeraj...@gmail.com] can you post reproducible codes that you ran?

> Cannot broadcast the table that is larger than 8GB : Spark 2.3
> --
>
> Key: SPARK-26075
> URL: https://issues.apache.org/jira/browse/SPARK-26075
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Neeraj Bhadani
>Priority: Major
>
>  I am trying to use the broadcast join but getting below error in Spark 2.3. 
> However, the same code is working fine in Spark 2.2
>  
> Upon checking the size of the dataframes its merely 50 MB and I have set the 
> threshold to 200 MB as well. As I mentioned above same code is working fine 
> in Spark 2.2
>  
> {{Error: "Cannot broadcast the table that is larger than 8GB". }}
> However, Disabling the broadcasting is working fine.
> {{'spark.sql.autoBroadcastJoinThreshold': '-1'}}
>  
> {{Regards,}}
> {{Neeraj}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26075) Cannot broadcast the table that is larger than 8GB : Spark 2.3

2019-01-03 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733799#comment-16733799
 ] 

Hyukjin Kwon commented on SPARK-26075:
--

[~bhadani.neeraj...@gmail.com] can you post reproducible codes that you ran?

> Cannot broadcast the table that is larger than 8GB : Spark 2.3
> --
>
> Key: SPARK-26075
> URL: https://issues.apache.org/jira/browse/SPARK-26075
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Neeraj Bhadani
>Priority: Major
>
>  I am trying to use the broadcast join but getting below error in Spark 2.3. 
> However, the same code is working fine in Spark 2.2
>  
> Upon checking the size of the dataframes its merely 50 MB and I have set the 
> threshold to 200 MB as well. As I mentioned above same code is working fine 
> in Spark 2.2
>  
> {{Error: "Cannot broadcast the table that is larger than 8GB". }}
> However, Disabling the broadcasting is working fine.
> {{'spark.sql.autoBroadcastJoinThreshold': '-1'}}
>  
> {{Regards,}}
> {{Neeraj}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26529) Add logs for IOException when preparing local resource

2019-01-03 Thread liupengcheng (JIRA)

liupengcheng created SPARK-26529:


 Summary: Add logs for IOException when preparing local resource 
 Key: SPARK-26529
 URL: https://issues.apache.org/jira/browse/SPARK-26529
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, Spark Core
Affects Versions: 2.4.0, 2.3.2
Reporter: liupengcheng


Currently, `Client#createConfArchive` do not handle IOException, and some 
detail info is not provided in logs. Sometimes, this may delay the time of 
locating the root cause of io error.

A case happened in our production environment is that local disk is full, and 
the following exception is thrown but no detail path info provided. we have to 
investigate all the local disk of the machine to find out the root cause.
{code:java}
Exception in thread "main" java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:345)
at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
at java.util.zip.ZipOutputStream.closeEntry(ZipOutputStream.java:238)
at java.util.zip.ZipOutputStream.finish(ZipOutputStream.java:343)
at java.util.zip.DeflaterOutputStream.close(DeflaterOutputStream.java:238)
at java.util.zip.ZipOutputStream.close(ZipOutputStream.java:360)
at org.apache.spark.deploy.yarn.Client.createConfArchive(Client.scala:769)
at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:657)
at 
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:895)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:177)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1202)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1261)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:767)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:214)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:128)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}
It make sense for us to catch the IOException and print some useful information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26126) Should put scala-library deps into root pom instead of spark-tags module

2019-01-03 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733796#comment-16733796
 ] 

Hyukjin Kwon commented on SPARK-26126:
--

If there's no actual problem hapenning, I would leave this resolved.

> Should put scala-library deps into root pom instead of spark-tags module
> 
>
> Key: SPARK-26126
> URL: https://issues.apache.org/jira/browse/SPARK-26126
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.0, 2.4.0
>Reporter: liupengcheng
>Priority: Minor
>
> When I do some backport in our custom spark, I notice some strange code from 
> spark-tags module:
> {code:java}
> 
>   
> org.scala-lang
> scala-library
> ${scala.version}
>   
> 
> {code}
> As i known, should spark-tags only contains some annotation related classes 
> or deps?
> should we put the scala-library deps to root pom?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26126) Should put scala-library deps into root pom instead of spark-tags module

2019-01-03 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733795#comment-16733795
 ] 

Hyukjin Kwon commented on SPARK-26126:
--

It matters since JIRA's supposed to file an issue. Questions should go to 
mailing list.
What problem does it cause?

> Should put scala-library deps into root pom instead of spark-tags module
> 
>
> Key: SPARK-26126
> URL: https://issues.apache.org/jira/browse/SPARK-26126
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.0, 2.4.0
>Reporter: liupengcheng
>Priority: Minor
>
> When I do some backport in our custom spark, I notice some strange code from 
> spark-tags module:
> {code:java}
> 
>   
> org.scala-lang
> scala-library
> ${scala.version}
>   
> 
> {code}
> As i known, should spark-tags only contains some annotation related classes 
> or deps?
> should we put the scala-library deps to root pom?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26528) FsHistoryProviderSuite failed in IDEA because not exist "spark.testing" property

2019-01-03 Thread deshanxiao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

deshanxiao updated SPARK-26528:
---
Priority: Minor  (was: Major)

> FsHistoryProviderSuite failed in IDEA because not exist "spark.testing" 
> property 
> -
>
> Key: SPARK-26528
> URL: https://issues.apache.org/jira/browse/SPARK-26528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: deshanxiao
>Priority: Minor
>
> Running the FsHistoryProviderSuite in idea failled because the property 
> "spark.testing" not exist.In this situation, replay executor may replay a 
> file twice.
> {code:java}
>   private val replayExecutor: ExecutorService = {
> if (!conf.contains("spark.testing")) {
>   ThreadUtils.newDaemonFixedThreadPool(NUM_PROCESSING_THREADS, 
> "log-replay-executor")
> } else {
>   MoreExecutors.sameThreadExecutor()
> }
>   }
> {code}
> {code:java}
> "SPARK-3697: ignore files that cannot be read."
> 2 was not equal to 1
> ScalaTestFailureLocation: 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12 at 
> (FsHistoryProviderSuite.scala:179)
> Expected :1
> Actual   :2
>  
> org.scalatest.exceptions.TestFailedException: 2 was not equal to 1
>   at 
> org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340)
>   at 
> org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668)
>   at org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:179)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:148)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite.org$scalatest$BeforeAndAfter$$super$runTest(FsHistoryProviderSuite.scala:51)
>   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:203)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite.runTest(FsHistoryProviderSuite.scala:51)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite$class.run(Suite.scala:1147)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite.org$scalatest$BeforeAndAfter$$super$run(FsHistoryProviderSuite.scala:51)
>   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:258)
>   at 
> org.apache.spark.deploy.history.FsHisto

[jira] [Assigned] (SPARK-26526) test case for SPARK-10316 is not valid any more

2019-01-03 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26526:
---

Assignee: Liu, Linhong

> test case for SPARK-10316 is not valid any more
> ---
>
> Key: SPARK-26526
> URL: https://issues.apache.org/jira/browse/SPARK-26526
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Liu, Linhong
>Assignee: Liu, Linhong
>Priority: Major
>
> Test case in [SPARK-10316|https://github.com/apache/spark/pull/8486] is used 
> to make sure non-deterministic `Filter` won't be pushed through `Project`
> But in current code base this test case can't cover this purpose.
> Change LogicalRDD to HadoopFsRelation can fix this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26526) test case for SPARK-10316 is not valid any more

2019-01-03 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26526.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23440
[https://github.com/apache/spark/pull/23440]

> test case for SPARK-10316 is not valid any more
> ---
>
> Key: SPARK-26526
> URL: https://issues.apache.org/jira/browse/SPARK-26526
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Liu, Linhong
>Assignee: Liu, Linhong
>Priority: Major
> Fix For: 3.0.0
>
>
> Test case in [SPARK-10316|https://github.com/apache/spark/pull/8486] is used 
> to make sure non-deterministic `Filter` won't be pushed through `Project`
> But in current code base this test case can't cover this purpose.
> Change LogicalRDD to HadoopFsRelation can fix this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26528) FsHistoryProviderSuite failed in IDEA because not exist "spark.testing" property

2019-01-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26528:


Assignee: Apache Spark

> FsHistoryProviderSuite failed in IDEA because not exist "spark.testing" 
> property 
> -
>
> Key: SPARK-26528
> URL: https://issues.apache.org/jira/browse/SPARK-26528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: deshanxiao
>Assignee: Apache Spark
>Priority: Major
>
> Running the FsHistoryProviderSuite in idea failled because the property 
> "spark.testing" not exist.In this situation, replay executor may replay a 
> file twice.
> {code:java}
>   private val replayExecutor: ExecutorService = {
> if (!conf.contains("spark.testing")) {
>   ThreadUtils.newDaemonFixedThreadPool(NUM_PROCESSING_THREADS, 
> "log-replay-executor")
> } else {
>   MoreExecutors.sameThreadExecutor()
> }
>   }
> {code}
> {code:java}
> "SPARK-3697: ignore files that cannot be read."
> 2 was not equal to 1
> ScalaTestFailureLocation: 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12 at 
> (FsHistoryProviderSuite.scala:179)
> Expected :1
> Actual   :2
>  
> org.scalatest.exceptions.TestFailedException: 2 was not equal to 1
>   at 
> org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340)
>   at 
> org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668)
>   at org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:179)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:148)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite.org$scalatest$BeforeAndAfter$$super$runTest(FsHistoryProviderSuite.scala:51)
>   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:203)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite.runTest(FsHistoryProviderSuite.scala:51)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite$class.run(Suite.scala:1147)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite.org$scalatest$BeforeAndAfter$$super$run(FsHistoryProviderSuite.scala:51)
>   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:258)
>   at 
>

[jira] [Assigned] (SPARK-26528) FsHistoryProviderSuite failed in IDEA because not exist "spark.testing" property

2019-01-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26528:


Assignee: (was: Apache Spark)

> FsHistoryProviderSuite failed in IDEA because not exist "spark.testing" 
> property 
> -
>
> Key: SPARK-26528
> URL: https://issues.apache.org/jira/browse/SPARK-26528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: deshanxiao
>Priority: Major
>
> Running the FsHistoryProviderSuite in idea failled because the property 
> "spark.testing" not exist.In this situation, replay executor may replay a 
> file twice.
> {code:java}
>   private val replayExecutor: ExecutorService = {
> if (!conf.contains("spark.testing")) {
>   ThreadUtils.newDaemonFixedThreadPool(NUM_PROCESSING_THREADS, 
> "log-replay-executor")
> } else {
>   MoreExecutors.sameThreadExecutor()
> }
>   }
> {code}
> {code:java}
> "SPARK-3697: ignore files that cannot be read."
> 2 was not equal to 1
> ScalaTestFailureLocation: 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12 at 
> (FsHistoryProviderSuite.scala:179)
> Expected :1
> Actual   :2
>  
> org.scalatest.exceptions.TestFailedException: 2 was not equal to 1
>   at 
> org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340)
>   at 
> org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668)
>   at org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:179)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:148)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite.org$scalatest$BeforeAndAfter$$super$runTest(FsHistoryProviderSuite.scala:51)
>   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:203)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite.runTest(FsHistoryProviderSuite.scala:51)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite$class.run(Suite.scala:1147)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite.org$scalatest$BeforeAndAfter$$super$run(FsHistoryProviderSuite.scala:51)
>   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:258)
>   at 
> org.apache.spark.deploy.

[jira] [Created] (SPARK-26528) FsHistoryProviderSuite failed in IDEA because not exist "spark.testing" property

2019-01-03 Thread deshanxiao (JIRA)

deshanxiao created SPARK-26528:
--

 Summary: FsHistoryProviderSuite failed in IDEA because not exist 
"spark.testing" property 
 Key: SPARK-26528
 URL: https://issues.apache.org/jira/browse/SPARK-26528
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0, 2.3.2
Reporter: deshanxiao


Running the FsHistoryProviderSuite in idea failled because the property 
"spark.testing" not exist.In this situation, replay executor may replay a file 
twice.


{code:java}
  private val replayExecutor: ExecutorService = {
if (!conf.contains("spark.testing")) {
  ThreadUtils.newDaemonFixedThreadPool(NUM_PROCESSING_THREADS, 
"log-replay-executor")
} else {
  MoreExecutors.sameThreadExecutor()
}
  }
{code}


{code:java}
"SPARK-3697: ignore files that cannot be read."

2 was not equal to 1
ScalaTestFailureLocation: 
org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12 at 
(FsHistoryProviderSuite.scala:179)
Expected :1
Actual   :2
 

org.scalatest.exceptions.TestFailedException: 2 was not equal to 1
at 
org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340)
at 
org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668)
at org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704)
at 
org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:179)
at 
org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:148)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
at 
org.apache.spark.deploy.history.FsHistoryProviderSuite.org$scalatest$BeforeAndAfter$$super$runTest(FsHistoryProviderSuite.scala:51)
at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:203)
at 
org.apache.spark.deploy.history.FsHistoryProviderSuite.runTest(FsHistoryProviderSuite.scala:51)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
at org.scalatest.Suite$class.run(Suite.scala:1147)
at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
at 
org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
at 
org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
at 
org.apache.spark.deploy.history.FsHistoryProviderSuite.org$scalatest$BeforeAndAfter$$super$run(FsHistoryProviderSuite.scala:51)
at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:258)
at 
org.apache.spark.deploy.history.FsHistoryProviderSuite.run(FsHistoryProviderSuite.scala:51)
at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:45)
at 
org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$1.apply(Runner.scala:1340)
at 
org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$1.apply(Runner.scala:1334)
at scala.collection.immut

[jira] [Assigned] (SPARK-26527) Let acquireUnrollMemory fail fast if required space exceeds memory limit

2019-01-03 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26527:
-

Assignee: SongYadong

> Let acquireUnrollMemory fail fast if required space exceeds memory limit
> 
>
> Key: SPARK-26527
> URL: https://issues.apache.org/jira/browse/SPARK-26527
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: SongYadong
>Assignee: SongYadong
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When acquiring unroll memory from {{StaticMemoryManager}}, let it fail fast 
> if required space exceeds memory limit, just like acquiring storage memory.
> I think this may reduce some computation and memory evicting costs especially 
> when required space({{numBytes}}) is very big.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-26494) 【spark sql】Use spark to read oracle TIMESTAMP(6) WITH LOCAL TIME ZONE type can't be found,

2019-01-03 Thread JIRA



 [ 
https://issues.apache.org/jira/browse/SPARK-26494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

秦坤 reopened SPARK-26494:


> 【spark sql】Use spark to read oracle TIMESTAMP(6) WITH LOCAL TIME ZONE type 
> can't be found,
> --
>
> Key: SPARK-26494
> URL: https://issues.apache.org/jira/browse/SPARK-26494
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: 秦坤
>Priority: Minor
>
> Use spark to read oracle TIMESTAMP(6) WITH LOCAL TIME ZONE type can't be 
> found,
> When the data type is TIMESTAMP(6) WITH LOCAL TIME ZONE
> At this point, the sqlType value of the function getCatalystType in the 
> JdbcUtils class is -102.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26494) 【spark sql】Use spark to read oracle TIMESTAMP(6) WITH LOCAL TIME ZONE type can't be found,

2019-01-03 Thread JIRA



 [ 
https://issues.apache.org/jira/browse/SPARK-26494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

秦坤 updated SPARK-26494:
---
Description: 
Use spark to read oracle TIMESTAMP(6) WITH LOCAL TIME ZONE type can't be found,

When the data type is TIMESTAMP(6) WITH LOCAL TIME ZONE

At this point, the sqlType value of the function getCatalystType in the 
JdbcUtils class is -102.

  was:
使用spark读取oracle TIMESTAMP(6) WITH LOCAL TIME ZONE 类型找不到,

 

当数据类型为TIMESTAMP(6) WITH LOCAL TIME ZONE  

此时JdbcUtils 类 里面函数getCatalystType的 sqlType 数值为-102

Summary: 【spark sql】Use spark to read oracle TIMESTAMP(6) WITH LOCAL 
TIME ZONE type can't be found,  (was: 【spark sql】 使用spark读取oracle TIMESTAMP(6) 
WITH LOCAL TIME ZONE 类型找不到)

> 【spark sql】Use spark to read oracle TIMESTAMP(6) WITH LOCAL TIME ZONE type 
> can't be found,
> --
>
> Key: SPARK-26494
> URL: https://issues.apache.org/jira/browse/SPARK-26494
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: 秦坤
>Priority: Minor
>
> Use spark to read oracle TIMESTAMP(6) WITH LOCAL TIME ZONE type can't be 
> found,
> When the data type is TIMESTAMP(6) WITH LOCAL TIME ZONE
> At this point, the sqlType value of the function getCatalystType in the 
> JdbcUtils class is -102.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26512) Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 10?

2019-01-03 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733740#comment-16733740
 ] 

Saisai Shao commented on SPARK-26512:
-

Please list the problems you saw, any log or exception. We can't tell anything 
with above information.

> Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 10?
> ---
>
> Key: SPARK-26512
> URL: https://issues.apache.org/jira/browse/SPARK-26512
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, YARN
>Affects Versions: 2.4.0
> Environment: operating system : Windows 10
> Spark Version : 2.4.0
> Hadoop Version : 2.8.3
>Reporter: Anubhav Jain
>Priority: Minor
>  Labels: windows
> Attachments: log.png
>
>
> I have installed Hadoop version 2.8.3 in my windows 10 environment and its 
> working fine. Now when i try to install Apache Spark(version 2.4.0) with yarn 
> as cluster manager and its not working. When i try to submit a spark job 
> using spark-submit for testing , so its coming under ACCEPTED tab in YARN UI 
> after that it fail



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26126) Should put scala-library deps into root pom instead of spark-tags module

2019-01-03 Thread liupengcheng (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733738#comment-16733738
 ] 

liupengcheng commented on SPARK-26126:
--

[~hyukjin.kwon] it's an issue, it's really doesn't matter, but it's just 
confusing.

> Should put scala-library deps into root pom instead of spark-tags module
> 
>
> Key: SPARK-26126
> URL: https://issues.apache.org/jira/browse/SPARK-26126
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.0, 2.4.0
>Reporter: liupengcheng
>Priority: Minor
>
> When I do some backport in our custom spark, I notice some strange code from 
> spark-tags module:
> {code:java}
> 
>   
> org.scala-lang
> scala-library
> ${scala.version}
>   
> 
> {code}
> As i known, should spark-tags only contains some annotation related classes 
> or deps?
> should we put the scala-library deps to root pom?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26452) Suppressing exception in finally: Java heap space java.lang.OutOfMemoryError: Java heap space

2019-01-03 Thread Jungtaek Lim (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733715#comment-16733715
 ] 

Jungtaek Lim commented on SPARK-26452:
--

Maybe better to let driver (force) shutting down if uncaught exception occurs 
in DAG event loop? 

I guess it is one of the main role in driver so driver looks to be 
malfunctioning to end users when DAG event thread is killed.

> Suppressing exception in finally: Java heap space java.lang.OutOfMemoryError: 
> Java heap space
> -
>
> Key: SPARK-26452
> URL: https://issues.apache.org/jira/browse/SPARK-26452
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0
>Reporter: tommy duan
>Priority: Major
>
> In spark2.2.0 structured streaming program，the shell code of submit as follow:
> {code:java}
> spark-submit.sh \
> --driver-memory 3g \
> --executor-nums 10 \
> --exucutor-memory 16g \
> {code}
> Fake death occurred after running for a day，the Executor is running status, 
> and the Driver exists also.But not task assigned ,The error message of 
> program print as follow:
> {code:java}
> [Stage 1852:===>(896 + 3) / 
> 900] 
> [Stage 1852:===>(897 + 3) / 
> 900] 
> [Stage 1852:===>(899 + 1) / 
> 900] 
> [Stage 1853:> (0 + 0) / 900]
> 18/12/27 06:03:45 WARN util.Utils: Suppressing exception in finally: Java 
> heap space java.lang.OutOfMemoryError: Java heap space
> at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57)
> at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$3.apply(TorrentBroadcast.scala:271)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$3.apply(TorrentBroadcast.scala:271)
> at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87)
> at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75)
> at 
> net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
> at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:158)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1822)
> at java.io.ObjectOutputStream.flush(ObjectOutputStream.java:719)
> at java.io.ObjectOutputStream.close(ObjectOutputStream.java:740)
> at 
> org.apache.spark.serializer.JavaSerializationStream.close(JavaSerializer.scala:57)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$1.apply$mcV$sp(TorrentBroadcast.scala:278)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:277)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:126)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88)
> at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
> at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:56)
> at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1488)
> at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1006)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:930)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:776)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:775)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
> at 
> org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:775)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1259)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1711)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
> Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: 
> Java heap space
> at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57)
> at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonf

[jira] [Created] (SPARK-26527) Let acquireUnrollMemory fail fast if required space exceeds memory limit

2019-01-03 Thread SongYadong (JIRA)

SongYadong created SPARK-26527:
--

 Summary: Let acquireUnrollMemory fail fast if required space 
exceeds memory limit
 Key: SPARK-26527
 URL: https://issues.apache.org/jira/browse/SPARK-26527
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: SongYadong


When acquiring unroll memory from {{StaticMemoryManager}}, let it fail fast if 
required space exceeds memory limit, just like acquiring storage memory.
I think this may reduce some computation and memory evicting costs especially 
when required space({{numBytes}}) is very big.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26527) Let acquireUnrollMemory fail fast if required space exceeds memory limit

2019-01-03 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733708#comment-16733708
 ] 

Apache Spark commented on SPARK-26527:
--

User 'SongYadong' has created a pull request for this issue:
https://github.com/apache/spark/pull/23426

> Let acquireUnrollMemory fail fast if required space exceeds memory limit
> 
>
> Key: SPARK-26527
> URL: https://issues.apache.org/jira/browse/SPARK-26527
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: SongYadong
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When acquiring unroll memory from {{StaticMemoryManager}}, let it fail fast 
> if required space exceeds memory limit, just like acquiring storage memory.
> I think this may reduce some computation and memory evicting costs especially 
> when required space({{numBytes}}) is very big.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26527) Let acquireUnrollMemory fail fast if required space exceeds memory limit

2019-01-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26527:


Assignee: (was: Apache Spark)

> Let acquireUnrollMemory fail fast if required space exceeds memory limit
> 
>
> Key: SPARK-26527
> URL: https://issues.apache.org/jira/browse/SPARK-26527
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: SongYadong
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When acquiring unroll memory from {{StaticMemoryManager}}, let it fail fast 
> if required space exceeds memory limit, just like acquiring storage memory.
> I think this may reduce some computation and memory evicting costs especially 
> when required space({{numBytes}}) is very big.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26527) Let acquireUnrollMemory fail fast if required space exceeds memory limit

2019-01-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26527:


Assignee: Apache Spark

> Let acquireUnrollMemory fail fast if required space exceeds memory limit
> 
>
> Key: SPARK-26527
> URL: https://issues.apache.org/jira/browse/SPARK-26527
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: SongYadong
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When acquiring unroll memory from {{StaticMemoryManager}}, let it fail fast 
> if required space exceeds memory limit, just like acquiring storage memory.
> I think this may reduce some computation and memory evicting costs especially 
> when required space({{numBytes}}) is very big.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26489) Use ConfigEntry for hardcoded configs for python/r categories.

2019-01-03 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-26489:
--

Assignee: Jungtaek Lim

> Use ConfigEntry for hardcoded configs for python/r categories.
> --
>
> Key: SPARK-26489
> URL: https://issues.apache.org/jira/browse/SPARK-26489
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Assignee: Jungtaek Lim
>Priority: Major
>
> Make the following hardcoded configs to use ConfigEntry.
> {code}
> spark.python
> spark.r
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26489) Use ConfigEntry for hardcoded configs for python/r categories.

2019-01-03 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26489.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23428
[https://github.com/apache/spark/pull/23428]

> Use ConfigEntry for hardcoded configs for python/r categories.
> --
>
> Key: SPARK-26489
> URL: https://issues.apache.org/jira/browse/SPARK-26489
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> Make the following hardcoded configs to use ConfigEntry.
> {code}
> spark.python
> spark.r
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25873) Date corruption when Spark and Hive both are on different timezones

2019-01-03 Thread Pablo Langa Blanco (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733558#comment-16733558
 ] 

Pablo Langa Blanco commented on SPARK-25873:


Hello, it seems duplicated with SPARK-25919 that has been resolved.
Could it be closed too?
Thank you
Regards

> Date corruption when Spark and Hive both are on different timezones
> ---
>
> Key: SPARK-25873
> URL: https://issues.apache.org/jira/browse/SPARK-25873
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 2.2.1
>Reporter: Pawan
>Priority: Major
>
> There is date alteration when loading date from one table to another in hive 
> through spark. This happens when Hive is on a remote machine with timezone 
> different than the one on which Spark is running. This happens only when the 
> Source table format is 
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> Below are the steps to produce the issue:
> 1. Create two tables as below in hive which has a timezone, say in, EST
> {code}
>  CREATE TABLE t_src(
>  name varchar(10),
>  dob timestamp
>  )
>  ROW FORMAT SERDE 
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
>  STORED AS INPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
>  OUTPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> {code}
> {code}
> INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01 
> 00:00:00.0'), ('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 00:00:00.0');
> {code}
>  
> {code}
>  CREATE TABLE t_tgt(
>  name varchar(10),
>  dob timestamp
>  );
> {code}
> 2. Copy {{hive-site.xml}} to {{spark-2.2.1-bin-hadoop2.7/conf}} folder, so 
> that when you create {{sqlContext}} for hive it connects to your remote hive 
> server.
> 3. Start your spark-shell on some other machine whose timezone is different 
> than that of Hive, say, PDT
> 4. Execute below code:
> {code}
> import org.apache.spark.sql.hive.HiveContext
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> val q0 = "TRUNCATE table t_tgt"
> val q1 = "SELECT CAST(alias.name AS String) as a0, alias.dob as a1 FROM t_src 
> alias"
> val q2 = "INSERT OVERWRITE TABLE t_tgt SELECT tbl0.a0 as c0, tbl0.a1 as c1 
> FROM tbl0"
> sqlContext.sql(q0)
> sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0")
> sqlContext.sql(q2)
> {code}
> 5. Now navigate to hive and check the contents of the {{TARGET table 
> (t_tgt)}}. The dob field will have incorrect values.
>  
> Is this a known issue? Is there any work around on this? Can it be fixed?
>  
> Thanks & regards,
> Pawan Lawale



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-25873) Date corruption when Spark and Hive both are on different timezones

2019-01-03 Thread Pablo Langa Blanco (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pablo Langa Blanco updated SPARK-25873:
---
Comment: was deleted

(was: Hi [~pawanlawale]

Timestamp in hive is saved as a long value without timezone associated to this 
value. It has no sense that spark access to the timezone of the remote cluster 
because different timestamps could be generated from different timezone 
(different from hive cluster timezone)

The best option is the application manage it’s own timezone associated to each 
timestamp if its requiered. ¿don’t you think?

Regards

Pablo)

> Date corruption when Spark and Hive both are on different timezones
> ---
>
> Key: SPARK-25873
> URL: https://issues.apache.org/jira/browse/SPARK-25873
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 2.2.1
>Reporter: Pawan
>Priority: Major
>
> There is date alteration when loading date from one table to another in hive 
> through spark. This happens when Hive is on a remote machine with timezone 
> different than the one on which Spark is running. This happens only when the 
> Source table format is 
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> Below are the steps to produce the issue:
> 1. Create two tables as below in hive which has a timezone, say in, EST
> {code}
>  CREATE TABLE t_src(
>  name varchar(10),
>  dob timestamp
>  )
>  ROW FORMAT SERDE 
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
>  STORED AS INPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
>  OUTPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> {code}
> {code}
> INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01 
> 00:00:00.0'), ('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 00:00:00.0');
> {code}
>  
> {code}
>  CREATE TABLE t_tgt(
>  name varchar(10),
>  dob timestamp
>  );
> {code}
> 2. Copy {{hive-site.xml}} to {{spark-2.2.1-bin-hadoop2.7/conf}} folder, so 
> that when you create {{sqlContext}} for hive it connects to your remote hive 
> server.
> 3. Start your spark-shell on some other machine whose timezone is different 
> than that of Hive, say, PDT
> 4. Execute below code:
> {code}
> import org.apache.spark.sql.hive.HiveContext
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> val q0 = "TRUNCATE table t_tgt"
> val q1 = "SELECT CAST(alias.name AS String) as a0, alias.dob as a1 FROM t_src 
> alias"
> val q2 = "INSERT OVERWRITE TABLE t_tgt SELECT tbl0.a0 as c0, tbl0.a1 as c1 
> FROM tbl0"
> sqlContext.sql(q0)
> sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0")
> sqlContext.sql(q2)
> {code}
> 5. Now navigate to hive and check the contents of the {{TARGET table 
> (t_tgt)}}. The dob field will have incorrect values.
>  
> Is this a known issue? Is there any work around on this? Can it be fixed?
>  
> Thanks & regards,
> Pawan Lawale



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25873) Date corruption when Spark and Hive both are on different timezones

2019-01-03 Thread Pablo Langa Blanco (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733538#comment-16733538
 ] 

Pablo Langa Blanco commented on SPARK-25873:


Hi [~pawanlawale]

Timestamp in hive is saved as a long value without timezone associated to this 
value. It has no sense that spark access to the timezone of the remote cluster 
because different timestamps could be generated from different timezone 
(different from hive cluster timezone)

The best option is the application manage it’s own timezone associated to each 
timestamp if its requiered. ¿don’t you think?

Regards

Pablo

> Date corruption when Spark and Hive both are on different timezones
> ---
>
> Key: SPARK-25873
> URL: https://issues.apache.org/jira/browse/SPARK-25873
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 2.2.1
>Reporter: Pawan
>Priority: Major
>
> There is date alteration when loading date from one table to another in hive 
> through spark. This happens when Hive is on a remote machine with timezone 
> different than the one on which Spark is running. This happens only when the 
> Source table format is 
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> Below are the steps to produce the issue:
> 1. Create two tables as below in hive which has a timezone, say in, EST
> {code}
>  CREATE TABLE t_src(
>  name varchar(10),
>  dob timestamp
>  )
>  ROW FORMAT SERDE 
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
>  STORED AS INPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
>  OUTPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> {code}
> {code}
> INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01 
> 00:00:00.0'), ('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 00:00:00.0');
> {code}
>  
> {code}
>  CREATE TABLE t_tgt(
>  name varchar(10),
>  dob timestamp
>  );
> {code}
> 2. Copy {{hive-site.xml}} to {{spark-2.2.1-bin-hadoop2.7/conf}} folder, so 
> that when you create {{sqlContext}} for hive it connects to your remote hive 
> server.
> 3. Start your spark-shell on some other machine whose timezone is different 
> than that of Hive, say, PDT
> 4. Execute below code:
> {code}
> import org.apache.spark.sql.hive.HiveContext
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> val q0 = "TRUNCATE table t_tgt"
> val q1 = "SELECT CAST(alias.name AS String) as a0, alias.dob as a1 FROM t_src 
> alias"
> val q2 = "INSERT OVERWRITE TABLE t_tgt SELECT tbl0.a0 as c0, tbl0.a1 as c1 
> FROM tbl0"
> sqlContext.sql(q0)
> sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0")
> sqlContext.sql(q2)
> {code}
> 5. Now navigate to hive and check the contents of the {{TARGET table 
> (t_tgt)}}. The dob field will have incorrect values.
>  
> Is this a known issue? Is there any work around on this? Can it be fixed?
>  
> Thanks & regards,
> Pawan Lawale



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26349) Pyspark should not accept insecure p4yj gateways

2019-01-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26349:


Assignee: Apache Spark

> Pyspark should not accept insecure p4yj gateways
> 
>
> Key: SPARK-26349
> URL: https://issues.apache.org/jira/browse/SPARK-26349
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Assignee: Apache Spark
>Priority: Blocker
>
> Pyspark normally produces a secure py4j connection between python & the jvm, 
> but it also lets users provide their own connection.  Spark should fail fast 
> if that connection is insecure.
> Note this is breaking a public api, which is why this is targeted at 3.0.0.  
> SPARK-26019 is about adding a warning, but still allowing the old behavior to 
> work, in 2.x



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26349) Pyspark should not accept insecure p4yj gateways

2019-01-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26349:


Assignee: (was: Apache Spark)

> Pyspark should not accept insecure p4yj gateways
> 
>
> Key: SPARK-26349
> URL: https://issues.apache.org/jira/browse/SPARK-26349
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Blocker
>
> Pyspark normally produces a secure py4j connection between python & the jvm, 
> but it also lets users provide their own connection.  Spark should fail fast 
> if that connection is insecure.
> Note this is breaking a public api, which is why this is targeted at 3.0.0.  
> SPARK-26019 is about adding a warning, but still allowing the old behavior to 
> work, in 2.x



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19261) Support `ALTER TABLE table_name ADD COLUMNS(..)` statement

2019-01-03 Thread suman gorantla (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-19261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733445#comment-16733445
 ] 

suman gorantla commented on SPARK-19261:


Please understand that we have command running successful in HIVE/IMPALA but 
not in spark while using spark.sql , Please let me know why the commands 
supported by hive are not  being supported by spark sql 

> Support `ALTER TABLE table_name ADD COLUMNS(..)` statement
> --
>
> Key: SPARK-19261
> URL: https://issues.apache.org/jira/browse/SPARK-19261
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: StanZhai
>Assignee: Xin Wu
>Priority: Major
> Fix For: 2.2.0
>
>
> We should support `ALTER TABLE table_name ADD COLUMNS(..)` statement, which 
> already be used in version < 2.x.
> This is very useful for those who want to upgrade there Spark version to 2.x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25401) Reorder the required ordering to match the table's output ordering for bucket join

2019-01-03 Thread David Vrba (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733431#comment-16733431
 ] 

David Vrba commented on SPARK-25401:


[~gwang3] could you please recommend someone who would take a look at this PR? 
Thank you

> Reorder the required ordering to match the table's output ordering for bucket 
> join
> --
>
> Key: SPARK-25401
> URL: https://issues.apache.org/jira/browse/SPARK-25401
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wang, Gang
>Priority: Major
>
> Currently, we check if SortExec is needed between a operator and its child 
> operator in method orderingSatisfies, and method orderingSatisfies require 
> the order in the SortOrders are all the same.
> While, take the following case into consideration.
>  * Table a is bucketed by (a1, a2), sorted by (a2, a1), and buckets number is 
> 200.
>  * Table b is bucketed by (b1, b2), sorted by (b2, b1), and buckets number is 
> 200.
>  * Table a join table b on (a1=b1, a2=b2)
> In this case, if the join is sort merge join, the query planner won't add 
> exchange on both sides, while, sort will be added on both sides. Actually, 
> sort is also unnecessary, since in the same bucket, like bucket 1 of table a, 
> and bucket 1 of table b, (a1=b1, a2=b2) is equivalent to (a2=b2, a1=b1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25253) Refactor pyspark connection & authentication

2019-01-03 Thread Imran Rashid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-25253:
-
Fix Version/s: 2.3.3

> Refactor pyspark connection & authentication
> 
>
> Key: SPARK-25253
> URL: https://issues.apache.org/jira/browse/SPARK-25253
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
> Fix For: 2.2.3, 2.3.3, 2.4.0
>
>
> We've got a few places in pyspark that connect to local sockets, with varying 
> levels of ipv6 handling, graceful error handling, and lots of copy-and-paste. 
>  should be pretty easy to clean this up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25253) Refactor pyspark connection & authentication

2019-01-03 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733428#comment-16733428
 ] 

Dongjoon Hyun commented on SPARK-25253:
---

Yep. Thanks, [~irashid]. :)

> Refactor pyspark connection & authentication
> 
>
> Key: SPARK-25253
> URL: https://issues.apache.org/jira/browse/SPARK-25253
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
> Fix For: 2.2.3, 2.3.3, 2.4.0
>
>
> We've got a few places in pyspark that connect to local sockets, with varying 
> levels of ipv6 handling, graceful error handling, and lots of copy-and-paste. 
>  should be pretty easy to clean this up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25253) Refactor pyspark connection & authentication

2019-01-03 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733424#comment-16733424
 ] 

Dongjoon Hyun edited comment on SPARK-25253 at 1/3/19 7:33 PM:
---

-I don't see this in `branch-2.2`?- Sorry, I was confused with other JIRA. Yes. 
This is in `branch-2.2` as I commented yesterday.


was (Author: dongjoon):
~I don't see this in `branch-2.2`?~ Sorry, I was confused with other JIRA. Yes. 
This is in `branch-2.2` as I commented yesterday.

> Refactor pyspark connection & authentication
> 
>
> Key: SPARK-25253
> URL: https://issues.apache.org/jira/browse/SPARK-25253
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
> Fix For: 2.2.3, 2.3.3, 2.4.0
>
>
> We've got a few places in pyspark that connect to local sockets, with varying 
> levels of ipv6 handling, graceful error handling, and lots of copy-and-paste. 
>  should be pretty easy to clean this up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25253) Refactor pyspark connection & authentication

2019-01-03 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733424#comment-16733424
 ] 

Dongjoon Hyun edited comment on SPARK-25253 at 1/3/19 7:32 PM:
---

~I don't see this in `branch-2.2`?~ Sorry, I was confused with other JIRA. Yes. 
This is in `branch-2.2` as I commented yesterday.


was (Author: dongjoon):
I don't see this in `branch-2.2`?

> Refactor pyspark connection & authentication
> 
>
> Key: SPARK-25253
> URL: https://issues.apache.org/jira/browse/SPARK-25253
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
> Fix For: 2.2.3, 2.3.3, 2.4.0
>
>
> We've got a few places in pyspark that connect to local sockets, with varying 
> levels of ipv6 handling, graceful error handling, and lots of copy-and-paste. 
>  should be pretty easy to clean this up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25253) Refactor pyspark connection & authentication

2019-01-03 Thread Imran Rashid (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733427#comment-16733427
 ] 

Imran Rashid commented on SPARK-25253:
--

in branch-2.2: 
https://github.com/apache/spark/commit/fc1c4e7d24f7d0afb3b79d66aa9812e7dddc2f38

in branch-2.3: 
https://github.com/apache/spark/commit/a2a54a5f49364a1825932c9f04eb0ff82dd7d465

> Refactor pyspark connection & authentication
> 
>
> Key: SPARK-25253
> URL: https://issues.apache.org/jira/browse/SPARK-25253
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
> Fix For: 2.2.3, 2.3.3, 2.4.0
>
>
> We've got a few places in pyspark that connect to local sockets, with varying 
> levels of ipv6 handling, graceful error handling, and lots of copy-and-paste. 
>  should be pretty easy to clean this up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25253) Refactor pyspark connection & authentication

2019-01-03 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733424#comment-16733424
 ] 

Dongjoon Hyun commented on SPARK-25253:
---

I don't see this in `branch-2.2`?

> Refactor pyspark connection & authentication
> 
>
> Key: SPARK-25253
> URL: https://issues.apache.org/jira/browse/SPARK-25253
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
> Fix For: 2.2.3, 2.3.3, 2.4.0
>
>
> We've got a few places in pyspark that connect to local sockets, with varying 
> levels of ipv6 handling, graceful error handling, and lots of copy-and-paste. 
>  should be pretty easy to clean this up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25253) Refactor pyspark connection & authentication

2019-01-03 Thread Imran Rashid (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733419#comment-16733419
 ] 

Imran Rashid commented on SPARK-25253:
--

[~hyukjin.kwon] yes, it looks like I did as part of other changes ... sorry I 
forgot to update this jira

> Refactor pyspark connection & authentication
> 
>
> Key: SPARK-25253
> URL: https://issues.apache.org/jira/browse/SPARK-25253
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
> Fix For: 2.2.3, 2.3.3, 2.4.0
>
>
> We've got a few places in pyspark that connect to local sockets, with varying 
> levels of ipv6 handling, graceful error handling, and lots of copy-and-paste. 
>  should be pretty easy to clean this up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26519) spark sql CHANGE COLUMN not working

2019-01-03 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733413#comment-16733413
 ] 

Dongjoon Hyun commented on SPARK-26519:
---

+1 for [~hyukjin.kwon].

[~sumanGorantla]. `Operation not allowed` means this is a new feature to Spark 
instead of `BUG`. We usually don't say like Hive/Impala has a bug when they 
don't support Spark commands like `CACHE TABLE` syntax.

> spark sql   CHANGE COLUMN not working 
> --
>
> Key: SPARK-26519
> URL: https://issues.apache.org/jira/browse/SPARK-26519
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: !image-2019-01-02-14-25-34-594.png!
>Reporter: suman gorantla
>Priority: Major
> Attachments: sparksql error.PNG
>
>
> Dear Team,
> with spark sql I am unable to change the newly added column() position after 
> an existing column in the table (old_column) of a hive external table please 
> see the screenshot as in below 
> scala> sql("ALTER TABLE enterprisedatalakedev.tmptst ADD COLUMNs (new_col  
> STRING)")
>  res14: org.apache.spark.sql.DataFrame = []
>  sql("ALTER TABLE enterprisedatalakedev.tmptst CHANGE COLUMN new_col new_col  
> STRING AFTER old_col ")
>  org.apache.spark.sql.catalyst.parser.ParseException:
>  Operation not allowed: ALTER TABLE table [PARTITION partition_spec] CHANGE 
> COLUMN ... FIRST | AFTER otherCol(line 1, pos 0)
> == SQL ==
>  ALTER TABLE enterprisedatalakedev.tmptst CHANGE COLUMN new_col  new_col  
> STRING AFTER  old_col 
>  ^^^
> at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.operationNotAllowed(ParserUtils.scala:39)
>  at 
> org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitChangeColumn$1.apply(SparkSqlParser.scala:934)
>  at 
> org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitChangeColumn$1.apply(SparkSqlParser.scala:928)
>  at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:99)
>  at 
> org.apache.spark.sql.execution.SparkSqlAstBuilder.visitChangeColumn(SparkSqlParser.scala:928)
>  at 
> org.apache.spark.sql.execution.SparkSqlAstBuilder.visitChangeColumn(SparkSqlParser.scala:55)
>  at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$ChangeColumnContext.accept(SqlBaseParser.java:1485)
>  at 
> org.antlr.v4.runtime.tree.AbstractParseTreeVisitor.visit(AbstractParseTreeVisitor.java:42)
>  at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:71)
>  at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:71)
>  at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:99)
>  at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleStatement(AstBuilder.scala:70)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:69)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:68)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:97)
>  at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623)
>  ... 48 elided
> !image-2019-01-02-14-25-40-980.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame returns incorrect value

2019-01-03 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733328#comment-16733328
 ] 

Dongjoon Hyun commented on SPARK-22951:
---

This is backported to `branch-2.2` via 
https://github.com/apache/spark/pull/23434 .

> count() after dropDuplicates() on emptyDataFrame returns incorrect value
> 
>
> Key: SPARK-22951
> URL: https://issues.apache.org/jira/browse/SPARK-22951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.2.0, 2.3.0
>Reporter: Michael Dreibelbis
>Assignee: Feng Liu
>Priority: Major
>  Labels: correctness
> Fix For: 2.2.3, 2.3.0
>
>
> here is a minimal Spark Application to reproduce:
> {code}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkConf, SparkContext}
> object DropDupesApp extends App {
>   
>   override def main(args: Array[String]): Unit = {
> val conf = new SparkConf()
>   .setAppName("test")
>   .setMaster("local")
> val sc = new SparkContext(conf)
> val sql = SQLContext.getOrCreate(sc)
> assert(sql.emptyDataFrame.count == 0) // expected
> assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected
>   }
>   
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame returns incorrect value

2019-01-03 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-22951:
--
Fix Version/s: 2.2.3

> count() after dropDuplicates() on emptyDataFrame returns incorrect value
> 
>
> Key: SPARK-22951
> URL: https://issues.apache.org/jira/browse/SPARK-22951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.2.0, 2.3.0
>Reporter: Michael Dreibelbis
>Assignee: Feng Liu
>Priority: Major
>  Labels: correctness
> Fix For: 2.2.3, 2.3.0
>
>
> here is a minimal Spark Application to reproduce:
> {code}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkConf, SparkContext}
> object DropDupesApp extends App {
>   
>   override def main(args: Array[String]): Unit = {
> val conf = new SparkConf()
>   .setAppName("test")
>   .setMaster("local")
> val sc = new SparkContext(conf)
> val sql = SQLContext.getOrCreate(sc)
> assert(sql.emptyDataFrame.count == 0) // expected
> assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected
>   }
>   
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26385) YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in cache

2019-01-03 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733318#comment-16733318
 ] 

Gabor Somogyi commented on SPARK-26385:
---

Yeah and additional logs would be also good, like drive/executor logs. Without 
them hard to tell.
{quote}token for REMOVED: HDFS_DELEGATION_TOKEN{quote} looks weird at least.

> YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in 
> cache
> ---
>
> Key: SPARK-26385
> URL: https://issues.apache.org/jira/browse/SPARK-26385
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Hadoop 2.6.0, Spark 2.4.0
>Reporter: T M
>Priority: Major
>
>  
> Hello,
>  
> I have Spark Structured Streaming job which is runnning on YARN(Hadoop 2.6.0, 
> Spark 2.4.0). After 25-26 hours, my job stops working with following error:
> {code:java}
> 2018-12-16 22:35:17 ERROR 
> org.apache.spark.internal.Logging$class.logError(Logging.scala:91): Query 
> TestQuery[id = a61ce197-1d1b-4e82-a7af-60162953488b, runId = 
> a56878cf-dfc7-4f6a-ad48-02cf738ccc2f] terminated with error 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (token for REMOVED: HDFS_DELEGATION_TOKEN owner=REMOVED, renewer=yarn, 
> realUser=, issueDate=1544903057122, maxDate=1545507857122, 
> sequenceNumber=10314, masterKeyId=344) can't be found in cache at 
> org.apache.hadoop.ipc.Client.call(Client.java:1470) at 
> org.apache.hadoop.ipc.Client.call(Client.java:1401) at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>  at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source) at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:752)
>  at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>  at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) at 
> org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1977) at 
> org.apache.hadoop.fs.Hdfs.getFileStatus(Hdfs.java:133) at 
> org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1120) at 
> org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1116) at 
> org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at 
> org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1116) at 
> org.apache.hadoop.fs.FileContext$Util.exists(FileContext.java:1581) at 
> org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.exists(CheckpointFileManager.scala:326)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.get(HDFSMetadataLog.scala:142)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply$mcV$sp(MicroBatchExecution.scala:544)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:542)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:542)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:554)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:542)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>  at 
> org.apache.spark.sql.execution.streaming

[jira] [Commented] (SPARK-26257) SPIP: Interop Support for Spark Language Extensions

2019-01-03 Thread Tyson Condie (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733297#comment-16733297
 ] 

Tyson Condie commented on SPARK-26257:
--

[~rxin] Any thoughts regarding this SPIP? I have a draft of a detailed design 
document describing what code changes could be made in order to facilitate a 
general purpose language interop API. Should that be shared here?

> SPIP: Interop Support for Spark Language Extensions
> ---
>
> Key: SPARK-26257
> URL: https://issues.apache.org/jira/browse/SPARK-26257
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, R, Spark Core
>Affects Versions: 2.4.0
>Reporter: Tyson Condie
>Priority: Minor
>
> h2.  ** Background and Motivation:
> There is a desire for third party language extensions for Apache Spark. Some 
> notable examples include:
>  * C#/F# from project Mobius [https://github.com/Microsoft/Mobius]
>  * Haskell from project sparkle [https://github.com/tweag/sparkle]
>  * Julia from project Spark.jl [https://github.com/dfdx/Spark.jl]
> Presently, Apache Spark supports Python and R via a tightly integrated 
> interop layer. It would seem that much of that existing interop layer could 
> be refactored into a clean surface for general (third party) language 
> bindings, such as the above mentioned. More specifically, could we generalize 
> the following modules:
>  * Deploy runners (e.g., PythonRunner and RRunner)
>  * DataFrame Executors
>  * RDD operations?
> The last being questionable: integrating third party language extensions at 
> the RDD level may be too heavy-weight and unnecessary given the preference 
> towards the Dataframe abstraction.
> The main goals of this effort would be:
>  * Provide a clean abstraction for third party language extensions making it 
> easier to maintain (the language extension) with the evolution of Apache Spark
>  * Provide guidance to third party language authors on how a language 
> extension should be implemented
>  * Provide general reusable libraries that are not specific to any language 
> extension
>  * Open the door to developers that prefer alternative languages
>  * Identify and clean up common code shared between Python and R interops
> h2. Target Personas:
> Data Scientists, Data Engineers, Library Developers
> h2. Goals:
> Data scientists and engineers will have the opportunity to work with Spark in 
> languages other than what’s natively supported. Library developers will be 
> able to create language extensions for Spark in a clean way. The interop 
> layer should also provide guidance for developing language extensions.
> h2. Non-Goals:
> The proposal does not aim to create an actual language extension. Rather, it 
> aims to provide a stable interop layer for third party language extensions to 
> dock.
> h2. Proposed API Changes:
> Much of the work will involve generalizing existing interop APIs for PySpark 
> and R, specifically for the Dataframe API. For instance, it would be good to 
> have a general deploy.Runner (similar to PythonRunner) for language extension 
> efforts. In Spark SQL, it would be good to have a general InteropUDF and 
> evaluator (similar to BatchEvalPythonExec).
> Low-level RDD operations should not be needed in this initial offering; 
> depending on the success of the interop layer and with proper demand, RDD 
> interop could be added later. However, one open question is supporting a 
> subset of low-level functions that are core to ETL e.g., transform.
> h2. Optional Design Sketch:
> The work would be broken down into two top-level phases:
>  Phase 1: Introduce general interop API for deploying a driver/application, 
> running an interop UDF along with any other low-level transformations that 
> aid with ETL.
> Phase 2: Port existing Python and R language extensions to the new interop 
> layer. This port should be contained solely to the Spark core side, and all 
> protocols specific to Python and R should not change e.g., Python should 
> continue to use py4j is the protocol between the Python process and core 
> Spark. The port itself should be contained to a handful of files e.g., some 
> examples for Python: PythonRunner, BatchEvalPythonExec, +PythonUDFRunner+, 
> PythonRDD (possibly), and will mostly involve refactoring common logic 
> abstract implementations and utilities.
> h2. Optional Rejected Designs:
> The clear alternative is the status quo; developers that want to provide a 
> third-party language extension to Spark do so directly; often by extending 
> existing Python classes and overriding the portions that are relevant to the 
> new extension. Not only is this not sound code (e.g., an JuliaRDD is not a 
> PythonRDD, which contains a lot of reusable code), but it runs the great risk 
> of future revisions making t

[jira] [Resolved] (SPARK-26522) Auth secret error in RBackendAuthHandler

2019-01-03 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26522.

Resolution: Not A Problem

Yes, this is an issue with Livy that is fixed in the master branch. Hopefully 
we can get to a release sometime soon...

> Auth secret error in RBackendAuthHandler
> 
>
> Key: SPARK-26522
> URL: https://issues.apache.org/jira/browse/SPARK-26522
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1
>Reporter: Euijun
>Priority: Minor
>
> Hi expert,
> I try to use livy to connect sparkR backend.
> This is related to 
> [https://stackoverflow.com/questions/53900995/livy-spark-r-issue]
>  
> Error message is,
> {code:java}
> Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, : 
> Auth secret not provided in environment.{code}
>  
> caused by,
> spark-2.3.1/R/pkg/R/sparkR.R
> {code:java}
> sparkR.sparkContext <- function(
> ...
>     authSecret <- Sys.getenv("SPARKR_BACKEND_AUTH_SECRET")
>     if (nchar(authSecret) == 0) {
>   stop("Auth secret not provided in environment.")
>     }
> ...
> )
> {code}
>  
> Best regard.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26517) Avoid duplicate test in ParquetSchemaPruningSuite

2019-01-03 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26517.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23427
[https://github.com/apache/spark/pull/23427]

> Avoid duplicate test in ParquetSchemaPruningSuite
> -
>
> Key: SPARK-26517
> URL: https://issues.apache.org/jira/browse/SPARK-26517
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 3.0.0
>
>
> `testExactCaseQueryPruning` and `testMixedCaseQueryPruning` don't need to set 
> up `PARQUET_VECTORIZED_READER_ENABLED` config. Because `withMixedCaseData` 
> will run against both Spark vectorized reader and Parquet-mr reader.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26447) Allow OrcColumnarBatchReader to return less partition columns

2019-01-03 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26447:
---

Assignee: Gengliang Wang

> Allow OrcColumnarBatchReader to return less partition columns
> -
>
> Key: SPARK-26447
> URL: https://issues.apache.org/jira/browse/SPARK-26447
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently OrcColumnarBatchReader returns all the partition column values in 
> the batch read.
> In data source V2, we can improve it by returning the required partition 
> column values only.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26359) Spark checkpoint restore fails after query restart

2019-01-03 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733209#comment-16733209
 ] 

Gabor Somogyi commented on SPARK-26359:
---

[~Tint] did the suggested workaround work? I would close this because this 
happened because of S3's read-after-write consistency.

[~ste...@apache.org]
{quote}With S3 the time to rename is about 6-10MB/s{quote}
all I can say wow :)


> Spark checkpoint restore fails after query restart
> --
>
> Key: SPARK-26359
> URL: https://issues.apache.org/jira/browse/SPARK-26359
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, Structured Streaming
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 deployed in standalone-client mode
> Checkpointing is done to S3
> The Spark application in question is responsible for running 4 different 
> queries
> Queries are written using Structured Streaming
> We are using the following algorithm for hopes of better performance:
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version: "2" # When the 
> default is 1
>Reporter: Kaspar Tint
>Priority: Major
> Attachments: driver-redacted, metadata, redacted-offsets, 
> state-redacted, worker-redacted
>
>
> We had an incident where one of our structured streaming queries could not be 
> restarted after an usual S3 checkpointing failure. Now to clarify before 
> everything else - we are aware of the issues with S3 and are working towards 
> moving to HDFS but this will take time. S3 will cause queries to fail quite 
> often during peak hours and we have separate logic to handle this that will 
> attempt to restart the failed queries if possible.
> In this particular case we could not restart one of the failed queries. Seems 
> like between detecting a failure in the query and starting it up again 
> something went really wrong with Spark and state in checkpoint folder got 
> corrupted for some reason.
> The issue starts with the usual *FileNotFoundException* that happens with S3
> {code:java}
> 2018-12-10 21:09:25.785 ERROR MicroBatchExecution: Query feedback [id = 
> c074233a-2563-40fc-8036-b5e38e2e2c42, runId = 
> e607eb6e-8431-4269-acab-cc2c1f9f09dd]
> terminated with error
> java.io.FileNotFoundException: No such file or directory: 
> s3a://some.domain/spark/checkpoints/49/feedback/offsets/.37
> 348.8227943f-a848-4af5-b5bf-1fea81775b24.tmp
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2255)
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2149)
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088)
> at 
> org.apache.hadoop.fs.FileSystem.getFileLinkStatus(FileSystem.java:2715)
> at 
> org.apache.hadoop.fs.DelegateToFileSystem.getFileLinkStatus(DelegateToFileSystem.java:131)
> at 
> org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:726)
> at 
> org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:699)
> at org.apache.hadoop.fs.FileContext.rename(FileContext.java:965)
> at 
> org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.renameTempFile(CheckpointFileManager.scala:331)
> at 
> org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:147)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataL
> og.scala:126)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply$mcV$sp(MicroBatchExecution.scala:382)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply(MicroBatchExecution.scala:381)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply(MicroBat

[jira] [Resolved] (SPARK-26447) Allow OrcColumnarBatchReader to return less partition columns

2019-01-03 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26447.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23387
[https://github.com/apache/spark/pull/23387]

> Allow OrcColumnarBatchReader to return less partition columns
> -
>
> Key: SPARK-26447
> URL: https://issues.apache.org/jira/browse/SPARK-26447
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently OrcColumnarBatchReader returns all the partition column values in 
> the batch read.
> In data source V2, we can improve it by returning the required partition 
> column values only.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26389) temp checkpoint folder at executor should be deleted on graceful shutdown

2019-01-03 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733201#comment-16733201
 ] 

Gabor Somogyi commented on SPARK-26389:
---

Temp checkpoint can be used in one-node scenario and deleted only if the query 
didn't fail.

> temp checkpoint folder at executor should be deleted on graceful shutdown
> -
>
> Key: SPARK-26389
> URL: https://issues.apache.org/jira/browse/SPARK-26389
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Fengyu Cao
>Priority: Major
>
> {{spark-submit --master mesos:// -conf 
> spark.streaming.stopGracefullyOnShutdown=true  framework>}}
> CTRL-C, framework shutdown
> {{18/12/18 10:27:36 ERROR MicroBatchExecution: Query [id = 
> f512e17a-df88-4414-a5cd-a23550cf1e7f, runId = 
> 24d99723-8d61-48c0-beab-af432f7a19d3] terminated with error 
> org.apache.spark.SparkException: Writing job aborted.}}
> {{/tmp/temporary- on executor not deleted due to 
> org.apache.spark.SparkException: Writing job aborted., and this temp 
> checkpoint can't used to recovery.}}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26517) Avoid duplicate test in ParquetSchemaPruningSuite

2019-01-03 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26517:
-

Assignee: Liang-Chi Hsieh

> Avoid duplicate test in ParquetSchemaPruningSuite
> -
>
> Key: SPARK-26517
> URL: https://issues.apache.org/jira/browse/SPARK-26517
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Minor
>
> `testExactCaseQueryPruning` and `testMixedCaseQueryPruning` don't need to set 
> up `PARQUET_VECTORIZED_READER_ENABLED` config. Because `withMixedCaseData` 
> will run against both Spark vectorized reader and Parquet-mr reader.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26501) Unexpected overriden of exitFn in SparkSubmitSuite

2019-01-03 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26501.
---
   Resolution: Fixed
 Assignee: liupengcheng
Fix Version/s: 3.0.0
   2.4.1

Resolved by https://github.com/apache/spark/pull/23404

> Unexpected overriden of exitFn in SparkSubmitSuite
> --
>
> Key: SPARK-26501
> URL: https://issues.apache.org/jira/browse/SPARK-26501
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: liupengcheng
>Assignee: liupengcheng
>Priority: Minor
> Fix For: 2.4.1, 3.0.0
>
>
> When I run SparkSubmitSuite of spark2.3.2 in intellij IDE, I found that some 
> tests cannot pass when I run them one by one, but they passed when the whole 
> SparkSubmitSuite was run.
> Failed tests when ran seperately:
>  
> {code:java}
> test("SPARK_CONF_DIR overrides spark-defaults.conf") {
>   forConfDir(Map("spark.executor.memory" -> "2.3g")) { path =>
> val unusedJar = TestUtils.createJarWithClasses(Seq.empty)
> val args = Seq(
>   "--class", SimpleApplicationTest.getClass.getName.stripSuffix("$"),
>   "--name", "testApp",
>   "--master", "local",
>   unusedJar.toString)
> val appArgs = new SparkSubmitArguments(args, Map("SPARK_CONF_DIR" -> 
> path))
> assert(appArgs.defaultPropertiesFile != null)
> assert(appArgs.defaultPropertiesFile.startsWith(path))
> assert(appArgs.propertiesFile == null)
> appArgs.executorMemory should be ("2.3g")
>   }
> }
> {code}
> Failure reason:
> {code:java}
> Error: Executor Memory cores must be a positive number
> Run with --help for usage help or --verbose for debug output
> {code}
>  
> After carefully checked the code, I found the exitFn of SparkSubmit is 
> overrided by front tests in testPrematrueExit call.
> Although the above test was fixed by SPARK-22941, but the overriden of exitFn 
> might cause other problems in the future.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26075) Cannot broadcast the table that is larger than 8GB : Spark 2.3

2019-01-03 Thread Neeraj Bhadani (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733173#comment-16733173
 ] 

Neeraj Bhadani commented on SPARK-26075:


[~hyukjin.kwon] I have verified with Spark 2.4 and getting the same error.

> Cannot broadcast the table that is larger than 8GB : Spark 2.3
> --
>
> Key: SPARK-26075
> URL: https://issues.apache.org/jira/browse/SPARK-26075
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Neeraj Bhadani
>Priority: Major
>
>  I am trying to use the broadcast join but getting below error in Spark 2.3. 
> However, the same code is working fine in Spark 2.2
>  
> Upon checking the size of the dataframes its merely 50 MB and I have set the 
> threshold to 200 MB as well. As I mentioned above same code is working fine 
> in Spark 2.2
>  
> {{Error: "Cannot broadcast the table that is larger than 8GB". }}
> However, Disabling the broadcasting is working fine.
> {{'spark.sql.autoBroadcastJoinThreshold': '-1'}}
>  
> {{Regards,}}
> {{Neeraj}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26514) Introduce a new conf to improve the task parallelism

2019-01-03 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26514.
---
Resolution: Won't Fix

> Introduce a new conf to improve the task parallelism
> 
>
> Key: SPARK-26514
> URL: https://issues.apache.org/jira/browse/SPARK-26514
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.4.0
>Reporter: zhoukang
>Priority: Major
>
> Currently, we have a conf below.
> {code:java}
> private[spark] val CPUS_PER_TASK = 
> ConfigBuilder("spark.task.cpus").intConf.createWithDefault(1)
> {code}
> For some applications which are not cpu-intensive,we may want to let one cpu 
> to run more than one task.
> Like:
> {code:java}
> private[spark] val TASKS_PER_CPU = 
> ConfigBuilder("spark.cpu.tasks").intConf.createWithDefault(1)
> {code}
> Which can improve performance for some applications and can also improve 
> resource utilization



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26434) disallow ADAPTIVE_EXECUTION_ENABLED&CBO_ENABLED in org.apache.spark.sql.execution.streaming.StreamExecution#runStream, but logWarning in org.apache.spark.sql.streaming.S

2019-01-03 Thread Gabor Somogyi (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-26434:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> disallow ADAPTIVE_EXECUTION_ENABLED&CBO_ENABLED in 
> org.apache.spark.sql.execution.streaming.StreamExecution#runStream, but 
> logWarning in org.apache.spark.sql.streaming.StreamingQueryManager#createQuery
> -
>
> Key: SPARK-26434
> URL: https://issues.apache.org/jira/browse/SPARK-26434
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: iamhumanbeing
>Priority: Trivial
>
> disallow ADAPTIVE_EXECUTION_ENABLED&CBO_ENABLED in 
> org.apache.spark.sql.execution.streaming.StreamExecution#runStream, but 
> logWarning in org.apache.spark.sql.streaming.StreamingQueryManager#createQuery
> it is better to logWarning in 
> org.apache.spark.sql.execution.streaming.StreamExecution#runStream



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26526) test case for SPARK-10316 is not valid any more

2019-01-03 Thread Liu, Linhong (JIRA)

Liu, Linhong created SPARK-26526:


 Summary: test case for SPARK-10316 is not valid any more
 Key: SPARK-26526
 URL: https://issues.apache.org/jira/browse/SPARK-26526
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 2.4.0
Reporter: Liu, Linhong


Test case in [SPARK-10316|https://github.com/apache/spark/pull/8486] is used to 
make sure non-deterministic `Filter` won't be pushed through `Project`
But in current code base this test case can't cover this purpose.
Change LogicalRDD to HadoopFsRelation can fix this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26526) test case for SPARK-10316 is not valid any more

2019-01-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26526:


Assignee: Apache Spark

> test case for SPARK-10316 is not valid any more
> ---
>
> Key: SPARK-26526
> URL: https://issues.apache.org/jira/browse/SPARK-26526
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Liu, Linhong
>Assignee: Apache Spark
>Priority: Major
>
> Test case in [SPARK-10316|https://github.com/apache/spark/pull/8486] is used 
> to make sure non-deterministic `Filter` won't be pushed through `Project`
> But in current code base this test case can't cover this purpose.
> Change LogicalRDD to HadoopFsRelation can fix this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26526) test case for SPARK-10316 is not valid any more

2019-01-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26526:


Assignee: (was: Apache Spark)

> test case for SPARK-10316 is not valid any more
> ---
>
> Key: SPARK-26526
> URL: https://issues.apache.org/jira/browse/SPARK-26526
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Liu, Linhong
>Priority: Major
>
> Test case in [SPARK-10316|https://github.com/apache/spark/pull/8486] is used 
> to make sure non-deterministic `Filter` won't be pushed through `Project`
> But in current code base this test case can't cover this purpose.
> Change LogicalRDD to HadoopFsRelation can fix this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26454) While creating new UDF with JAR though UDF is created successfully, it throws IllegegalArgument Exception

2019-01-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26454:


Assignee: (was: Apache Spark)

> While creating new UDF with JAR though UDF is created successfully, it throws 
> IllegegalArgument Exception
> -
>
> Key: SPARK-26454
> URL: https://issues.apache.org/jira/browse/SPARK-26454
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.2
>Reporter: Udbhav Agrawal
>Priority: Trivial
> Attachments: create_exception.txt
>
>
> 【Test step】：
>  1.launch spark-shell
>  2. set role admin;
>  3. create new function
>    CREATE FUNCTION Func AS 
> 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR 
> 'hdfs:///tmp/super_udf/two_udfs.jar'
>  4. Do select on the function
>  sql("select Func('2018-03-09')").show()
>  5.Create new UDF with same JAR
>     sql("CREATE FUNCTION newFunc AS 
> 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR 
> 'hdfs:///tmp/super_udf/two_udfs.jar'")
> 6. Do select on the new function created.
> sql("select newFunc ('2018-03-09')").show()
> 【Output】:
> Function is getting created but illegal argument exception is thrown , select 
> provides result but with illegal argument exception.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26454) While creating new UDF with JAR though UDF is created successfully, it throws IllegegalArgument Exception

2019-01-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26454:


Assignee: Apache Spark

> While creating new UDF with JAR though UDF is created successfully, it throws 
> IllegegalArgument Exception
> -
>
> Key: SPARK-26454
> URL: https://issues.apache.org/jira/browse/SPARK-26454
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.2
>Reporter: Udbhav Agrawal
>Assignee: Apache Spark
>Priority: Trivial
> Attachments: create_exception.txt
>
>
> 【Test step】：
>  1.launch spark-shell
>  2. set role admin;
>  3. create new function
>    CREATE FUNCTION Func AS 
> 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR 
> 'hdfs:///tmp/super_udf/two_udfs.jar'
>  4. Do select on the function
>  sql("select Func('2018-03-09')").show()
>  5.Create new UDF with same JAR
>     sql("CREATE FUNCTION newFunc AS 
> 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR 
> 'hdfs:///tmp/super_udf/two_udfs.jar'")
> 6. Do select on the new function created.
> sql("select newFunc ('2018-03-09')").show()
> 【Output】:
> Function is getting created but illegal argument exception is thrown , select 
> provides result but with illegal argument exception.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26525) Fast release memory of ShuffleBlockFetcherIterator

2019-01-03 Thread liupengcheng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liupengcheng updated SPARK-26525:
-
Description: 
Currently, spark would not release ShuffleBlockFetcherIterator until the whole 
task finished.

In some conditions, it incurs memory leak.

An example is Shuffle -> map -> Coalesce(shuffle = false). Each 
ShuffleBlockFetcherIterator contains  some metas about 
MapStatus(blocksByAddress) and each ShuffleMapTask will keep n(max to shuffle 
partitions) shuffleBlockFetcherIterator for they are refered by 
onCompleteCallbacks of TaskContext, in some case, it may take huge memory and 
the memory will not released until the task finished.

Actually, We can release ShuffleBlockFetcherIterator as soon as it's consumed.

  was:
Currently, spark would not release ShuffleBlockFetcherIterator until the whole 
task finished.

In some conditions, it incurs memory leak.

An example is Shuffle -> map -> Coalesce(shuffle = false). Each ShuffleMapTask 
will keep n(max to shuffle partitions) shuffleBlockFetcherIterator for they are 
refered by onCompleteCallbacks of TaskContext, and each 
ShuffleBlockFetcherIterator contains  some metas about 
MapStatus(blocksByAddress), in some case, it may take huge memory and the 
memory will not released until the task finished.

Actually, We can release ShuffleBlockFetcherIterator as soon as it's consumed.


> Fast release memory of ShuffleBlockFetcherIterator
> --
>
> Key: SPARK-26525
> URL: https://issues.apache.org/jira/browse/SPARK-26525
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.2
>Reporter: liupengcheng
>Priority: Major
>
> Currently, spark would not release ShuffleBlockFetcherIterator until the 
> whole task finished.
> In some conditions, it incurs memory leak.
> An example is Shuffle -> map -> Coalesce(shuffle = false). Each 
> ShuffleBlockFetcherIterator contains  some metas about 
> MapStatus(blocksByAddress) and each ShuffleMapTask will keep n(max to shuffle 
> partitions) shuffleBlockFetcherIterator for they are refered by 
> onCompleteCallbacks of TaskContext, in some case, it may take huge memory and 
> the memory will not released until the task finished.
> Actually, We can release ShuffleBlockFetcherIterator as soon as it's consumed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26433) Tail method for spark DataFrame

2019-01-03 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732895#comment-16732895
 ] 

Hyukjin Kwon commented on SPARK-26433:
--

There are few potential workarounds. For instnace, 
http://www.swi.com/spark-rdd-getting-bottom-records/ or 
{{df.sort($"ColumnName".desc).show()}}.
BTW, usually tail or head are used in Scala as below (IMHO):

{code}
scala> Seq(1, 2, 3).tail
res10: Seq[Int] = List(2, 3)

scala> Seq(1, 2, 3).head
res11: Int = 1
{code}



> Tail method for spark DataFrame
> ---
>
> Key: SPARK-26433
> URL: https://issues.apache.org/jira/browse/SPARK-26433
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Jan Gorecki
>Priority: Major
>
> There is a head method for spark dataframes which work fine but there doesn't 
> seems to be tail method.
> ```
> >>> ans   
> >>>   
> DataFrame[v1: bigint] 
>   
> >>> ans.head(3)   
> >>>  
> [Row(v1=299443), Row(v1=299493), Row(v1=300751)]
> >>> ans.tail(3)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/home/jan/git/db-benchmark/spark/py-spark/lib/python3.6/site-packages/py
> spark/sql/dataframe.py", line 1300, in __getattr__
> "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
> AttributeError: 'DataFrame' object has no attribute 'tail'
> ```
> I would like to feature request Tail method for spark dataframe



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26525) Fast release memory of ShuffleBlockFetcherIterator

2019-01-03 Thread liupengcheng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liupengcheng updated SPARK-26525:
-
Description: 
Currently, spark would not release ShuffleBlockFetcherIterator until the whole 
task finished.

In some conditions, it incurs memory leak.

An example is Shuffle -> map -> Coalesce(shuffle = false). Each ShuffleMapTask 
will keep n(max to shuffle partitions) shuffleBlockFetcherIterator for they are 
refered by onCompleteCallbacks of TaskContext, and each 
ShuffleBlockFetcherIterator contains  some metas about 
MapStatus(blocksByAddress), in some case, it may take huge memory and the 
memory will not released until the task finished.

Actually, We can release ShuffleBlockFetcherIterator as soon as it's consumed.

  was:
Currently, spark would not release ShuffleBlockFetcherIterator until the whole 
task finished.

In some conditions, it incurs memory leak, because it contains  some metas 
about MapStatus(blocksByAddress), which may take huge memory.

An example is Shuffle -> map -> Coalesce(shuffle = false), each ShuffleMapTask 
will keep n(max to shuffle partitions) shuffleBlockFetcherIterator for they are 
refered by onCompleteCallbacks of TaskContext.

We can release ShuffleBlockFetcherIterator as soon as it's consumed.


> Fast release memory of ShuffleBlockFetcherIterator
> --
>
> Key: SPARK-26525
> URL: https://issues.apache.org/jira/browse/SPARK-26525
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.2
>Reporter: liupengcheng
>Priority: Major
>
> Currently, spark would not release ShuffleBlockFetcherIterator until the 
> whole task finished.
> In some conditions, it incurs memory leak.
> An example is Shuffle -> map -> Coalesce(shuffle = false). Each 
> ShuffleMapTask will keep n(max to shuffle partitions) 
> shuffleBlockFetcherIterator for they are refered by onCompleteCallbacks of 
> TaskContext, and each ShuffleBlockFetcherIterator contains  some metas about 
> MapStatus(blocksByAddress), in some case, it may take huge memory and the 
> memory will not released until the task finished.
> Actually, We can release ShuffleBlockFetcherIterator as soon as it's consumed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25253) Refactor pyspark connection & authentication

2019-01-03 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732882#comment-16732882
 ] 

Hyukjin Kwon commented on SPARK-25253:
--

[~irashid], did you manually backport this into branch-2.2?

> Refactor pyspark connection & authentication
> 
>
> Key: SPARK-25253
> URL: https://issues.apache.org/jira/browse/SPARK-25253
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
> Fix For: 2.2.3, 2.4.0
>
>
> We've got a few places in pyspark that connect to local sockets, with varying 
> levels of ipv6 handling, graceful error handling, and lots of copy-and-paste. 
>  should be pretty easy to clean this up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26525) Fast release memory of ShuffleBlockFetcherIterator

2019-01-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26525:


Assignee: (was: Apache Spark)

> Fast release memory of ShuffleBlockFetcherIterator
> --
>
> Key: SPARK-26525
> URL: https://issues.apache.org/jira/browse/SPARK-26525
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.2
>Reporter: liupengcheng
>Priority: Major
>
> Currently, spark would not release ShuffleBlockFetcherIterator until the 
> whole task finished.
> In some conditions, it incurs memory leak, because it contains  some metas 
> about MapStatus(blocksByAddress), which may take huge memory.
> An example is Shuffle -> map -> Coalesce(shuffle = false), each 
> ShuffleMapTask will keep n(max to shuffle partitions) 
> shuffleBlockFetcherIterator for they are refered by onCompleteCallbacks of 
> TaskContext.
> We can release ShuffleBlockFetcherIterator as soon as it's consumed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26502) Get rid of hiveResultString() in QueryExecution

2019-01-03 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732874#comment-16732874
 ] 

Hyukjin Kwon commented on SPARK-26502:
--

Can we update JIRA as well, [~maxgekk]?

> Get rid of hiveResultString() in QueryExecution
> ---
>
> Key: SPARK-26502
> URL: https://issues.apache.org/jira/browse/SPARK-26502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> The method hiveResultString() of QueryExecution is used in test and 
> SparkSQLDriver. It should be moved from QueryExecution to more specific class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26502) Get rid of hiveResultString() in QueryExecution

2019-01-03 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732873#comment-16732873
 ] 

Hyukjin Kwon commented on SPARK-26502:
--

Fixed in https://github.com/apache/spark/pull/23409

> Get rid of hiveResultString() in QueryExecution
> ---
>
> Key: SPARK-26502
> URL: https://issues.apache.org/jira/browse/SPARK-26502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> The method hiveResultString() of QueryExecution is used in test and 
> SparkSQLDriver. It should be moved from QueryExecution to more specific class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26525) Fast release memory of ShuffleBlockFetcherIterator

2019-01-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26525:


Assignee: Apache Spark

> Fast release memory of ShuffleBlockFetcherIterator
> --
>
> Key: SPARK-26525
> URL: https://issues.apache.org/jira/browse/SPARK-26525
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.2
>Reporter: liupengcheng
>Assignee: Apache Spark
>Priority: Major
>
> Currently, spark would not release ShuffleBlockFetcherIterator until the 
> whole task finished.
> In some conditions, it incurs memory leak, because it contains  some metas 
> about MapStatus(blocksByAddress), which may take huge memory.
> An example is Shuffle -> map -> Coalesce(shuffle = false), each 
> ShuffleMapTask will keep n(max to shuffle partitions) 
> shuffleBlockFetcherIterator for they are refered by onCompleteCallbacks of 
> TaskContext.
> We can release ShuffleBlockFetcherIterator as soon as it's consumed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26502) Get rid of hiveResultString() in QueryExecution

2019-01-03 Thread Herman van Hovell (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-26502.
---
   Resolution: Fixed
 Assignee: Maxim Gekk
Fix Version/s: 3.0.0

> Get rid of hiveResultString() in QueryExecution
> ---
>
> Key: SPARK-26502
> URL: https://issues.apache.org/jira/browse/SPARK-26502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> The method hiveResultString() of QueryExecution is used in test and 
> SparkSQLDriver. It should be moved from QueryExecution to more specific class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26524) If the application directory fails to be created on the SPARK_WORKER_DIR on some woker nodes (for example, bad disk or disk has no capacity), the application executor w

2019-01-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26524:


Assignee: Apache Spark

> If the application directory fails to be created on the SPARK_WORKER_DIR on 
> some woker nodes (for example, bad disk or disk has no capacity), the 
> application executor will be allocated indefinitely.
> --
>
> Key: SPARK-26524
> URL: https://issues.apache.org/jira/browse/SPARK-26524
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: hantiantian
>Assignee: Apache Spark
>Priority: Major
>
> When the spark worker is started, the workerdir is created successfully. When 
> the application is submitted, the disks mounted by the workerdir and 
> worker122 workerdir are damaged.
> When a worker allocates an executor, it creates a working directory and a 
> temporary directory. If the creation fails, the executor allocation fails. 
> The application directory fails to be created on the SPARK_WORKER_DIR on 
> woker121 and worker122，the application executor will be allocated 
> indefinitely.
> 2019-01-03 15:50:00,525 INFO org.apache.spark.deploy.master.Master: Removing 
> executor app-20190103154954-/5762 because it is FAILED
> 2019-01-03 15:50:00,525 INFO org.apache.spark.deploy.master.Master: Launching 
> executor app-20190103154954-/5765 on worker 
> worker-20190103154858-worker121-37199
> 2019-01-03 15:50:00,526 INFO org.apache.spark.deploy.master.Master: Removing 
> executor app-20190103154954-/5764 because it is FAILED
> 2019-01-03 15:50:00,526 INFO org.apache.spark.deploy.master.Master: Launching 
> executor app-20190103154954-/5766 on worker 
> worker-20190103154920-worker122-41273
> 2019-01-03 15:50:00,527 INFO org.apache.spark.deploy.master.Master: Removing 
> executor app-20190103154954-/5766 because it is FAILED
> 2019-01-03 15:50:00,527 INFO org.apache.spark.deploy.master.Master: Launching 
> executor app-20190103154954-/5767 on worker 
> worker-20190103154920-worker122-41273
> 2019-01-03 15:50:00,532 INFO org.apache.spark.deploy.master.Master: Removing 
> executor app-20190103154954-/5765 because it is FAILED
> 2019-01-03 15:50:00,532 INFO org.apache.spark.deploy.master.Master: Launching 
> executor app-20190103154954-/5768 on worker 
> worker-20190103154858-worker121-37199
> ...
> I observed the code and found that spark has some processing for the failure 
> of the executor allocation. However, it can only handle the case where the 
> current application does not have an executor that has been successfully 
> assigned.
> if (!normalExit
>  && appInfo.incrementRetryCount() >= MAX_EXECUTOR_RETRIES
>  && MAX_EXECUTOR_RETRIES >= 0) { // < 0 disables this application-killing path
>  val execs = appInfo.executors.values
>  if (!execs.exists(_.state == ExecutorState.RUNNING)) {
>  logError(s"Application ${appInfo.desc.name} with ID ${appInfo.id} failed " +
>  s"${appInfo.retryCount} times; removing it")
>  removeApplication(appInfo, ApplicationState.FAILED)
>  }
> }
> Master will only judge whether the worker is available according to the 
> resources of the worker. 
> // Filter out workers that don't have enough resources to launch an executor
> val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
>  .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&
>  worker.coresFree >= coresPerExecutor)
>  .sortBy(_.coresFree).reverse
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26524) If the application directory fails to be created on the SPARK_WORKER_DIR on some woker nodes (for example, bad disk or disk has no capacity), the application executor w

2019-01-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26524:


Assignee: (was: Apache Spark)

> If the application directory fails to be created on the SPARK_WORKER_DIR on 
> some woker nodes (for example, bad disk or disk has no capacity), the 
> application executor will be allocated indefinitely.
> --
>
> Key: SPARK-26524
> URL: https://issues.apache.org/jira/browse/SPARK-26524
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: hantiantian
>Priority: Major
>
> When the spark worker is started, the workerdir is created successfully. When 
> the application is submitted, the disks mounted by the workerdir and 
> worker122 workerdir are damaged.
> When a worker allocates an executor, it creates a working directory and a 
> temporary directory. If the creation fails, the executor allocation fails. 
> The application directory fails to be created on the SPARK_WORKER_DIR on 
> woker121 and worker122，the application executor will be allocated 
> indefinitely.
> 2019-01-03 15:50:00,525 INFO org.apache.spark.deploy.master.Master: Removing 
> executor app-20190103154954-/5762 because it is FAILED
> 2019-01-03 15:50:00,525 INFO org.apache.spark.deploy.master.Master: Launching 
> executor app-20190103154954-/5765 on worker 
> worker-20190103154858-worker121-37199
> 2019-01-03 15:50:00,526 INFO org.apache.spark.deploy.master.Master: Removing 
> executor app-20190103154954-/5764 because it is FAILED
> 2019-01-03 15:50:00,526 INFO org.apache.spark.deploy.master.Master: Launching 
> executor app-20190103154954-/5766 on worker 
> worker-20190103154920-worker122-41273
> 2019-01-03 15:50:00,527 INFO org.apache.spark.deploy.master.Master: Removing 
> executor app-20190103154954-/5766 because it is FAILED
> 2019-01-03 15:50:00,527 INFO org.apache.spark.deploy.master.Master: Launching 
> executor app-20190103154954-/5767 on worker 
> worker-20190103154920-worker122-41273
> 2019-01-03 15:50:00,532 INFO org.apache.spark.deploy.master.Master: Removing 
> executor app-20190103154954-/5765 because it is FAILED
> 2019-01-03 15:50:00,532 INFO org.apache.spark.deploy.master.Master: Launching 
> executor app-20190103154954-/5768 on worker 
> worker-20190103154858-worker121-37199
> ...
> I observed the code and found that spark has some processing for the failure 
> of the executor allocation. However, it can only handle the case where the 
> current application does not have an executor that has been successfully 
> assigned.
> if (!normalExit
>  && appInfo.incrementRetryCount() >= MAX_EXECUTOR_RETRIES
>  && MAX_EXECUTOR_RETRIES >= 0) { // < 0 disables this application-killing path
>  val execs = appInfo.executors.values
>  if (!execs.exists(_.state == ExecutorState.RUNNING)) {
>  logError(s"Application ${appInfo.desc.name} with ID ${appInfo.id} failed " +
>  s"${appInfo.retryCount} times; removing it")
>  removeApplication(appInfo, ApplicationState.FAILED)
>  }
> }
> Master will only judge whether the worker is available according to the 
> resources of the worker. 
> // Filter out workers that don't have enough resources to launch an executor
> val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
>  .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&
>  worker.coresFree >= coresPerExecutor)
>  .sortBy(_.coresFree).reverse
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26525) Fast release memory of ShuffleBlockFetcherIterator

2019-01-03 Thread liupengcheng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liupengcheng updated SPARK-26525:
-
Description: 
Currently, spark would not release ShuffleBlockFetcherIterator until the whole 
task finished.

In some conditions, it incurs memory leak, because it contains  some metas 
about MapStatus(blocksByAddress), which may take huge memory.

An example is Shuffle -> map -> Coalesce(shuffle = false), each ShuffleMapTask 
will keep n(max to shuffle partitions) shuffleBlockFetcherIterator for they are 
refered by onCompleteCallbacks of TaskContext.

We can release ShuffleBlockFetcherIterator as soon as it's consumed.

  was:
Currently, spark would not release ShuffleBlockFetcherIterator until the whole 
task finished.

In some conditions, it incurs memory leak.

An example is Shuffle -> map -> Coalesce(shuffle = false), each ShuffleMapTask 
will keep n(max to shuffle partitions) shuffleBlockFetcherIterator for they are 
refered by onCompleteCallbacks of TaskContext.

We can release ShuffleBlockFetcherIterator as soon as it's consumed.


> Fast release memory of ShuffleBlockFetcherIterator
> --
>
> Key: SPARK-26525
> URL: https://issues.apache.org/jira/browse/SPARK-26525
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.2
>Reporter: liupengcheng
>Priority: Major
>
> Currently, spark would not release ShuffleBlockFetcherIterator until the 
> whole task finished.
> In some conditions, it incurs memory leak, because it contains  some metas 
> about MapStatus(blocksByAddress), which may take huge memory.
> An example is Shuffle -> map -> Coalesce(shuffle = false), each 
> ShuffleMapTask will keep n(max to shuffle partitions) 
> shuffleBlockFetcherIterator for they are refered by onCompleteCallbacks of 
> TaskContext.
> We can release ShuffleBlockFetcherIterator as soon as it's consumed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26525) Fast release memory of ShuffleBlockFetcherIterator

2019-01-03 Thread liupengcheng (JIRA)

liupengcheng created SPARK-26525:


 Summary: Fast release memory of ShuffleBlockFetcherIterator
 Key: SPARK-26525
 URL: https://issues.apache.org/jira/browse/SPARK-26525
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 2.3.2
Reporter: liupengcheng


Currently, spark would not release ShuffleBlockFetcherIterator until the whole 
task finished.

In some conditions, it incurs memory leak.

An example is Shuffle -> map -> Coalesce(shuffle = false), each ShuffleMapTask 
will keep n(max to shuffle partitions) shuffleBlockFetcherIterator for they are 
refered by onCompleteCallbacks of TaskContext.

We can release ShuffleBlockFetcherIterator as soon as it's consumed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26524) If the application directory fails to be created on the SPARK_WORKER_DIR on some woker nodes (for example, bad disk or disk has no capacity), the application executor wi

2019-01-03 Thread hantiantian (JIRA)

hantiantian created SPARK-26524:
---

 Summary: If the application directory fails to be created on the 
SPARK_WORKER_DIR on some woker nodes (for example, bad disk or disk has no 
capacity), the application executor will be allocated indefinitely.
 Key: SPARK-26524
 URL: https://issues.apache.org/jira/browse/SPARK-26524
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: hantiantian


When the spark worker is started, the workerdir is created successfully. When 
the application is submitted, the disks mounted by the workerdir and worker122 
workerdir are damaged.

When a worker allocates an executor, it creates a working directory and a 
temporary directory. If the creation fails, the executor allocation fails. The 
application directory fails to be created on the SPARK_WORKER_DIR on woker121 
and worker122，the application executor will be allocated indefinitely.

2019-01-03 15:50:00,525 INFO org.apache.spark.deploy.master.Master: Removing 
executor app-20190103154954-/5762 because it is FAILED
2019-01-03 15:50:00,525 INFO org.apache.spark.deploy.master.Master: Launching 
executor app-20190103154954-/5765 on worker 
worker-20190103154858-worker121-37199
2019-01-03 15:50:00,526 INFO org.apache.spark.deploy.master.Master: Removing 
executor app-20190103154954-/5764 because it is FAILED
2019-01-03 15:50:00,526 INFO org.apache.spark.deploy.master.Master: Launching 
executor app-20190103154954-/5766 on worker 
worker-20190103154920-worker122-41273
2019-01-03 15:50:00,527 INFO org.apache.spark.deploy.master.Master: Removing 
executor app-20190103154954-/5766 because it is FAILED
2019-01-03 15:50:00,527 INFO org.apache.spark.deploy.master.Master: Launching 
executor app-20190103154954-/5767 on worker 
worker-20190103154920-worker122-41273
2019-01-03 15:50:00,532 INFO org.apache.spark.deploy.master.Master: Removing 
executor app-20190103154954-/5765 because it is FAILED
2019-01-03 15:50:00,532 INFO org.apache.spark.deploy.master.Master: Launching 
executor app-20190103154954-/5768 on worker 
worker-20190103154858-worker121-37199

...

I observed the code and found that spark has some processing for the failure of 
the executor allocation. However, it can only handle the case where the current 
application does not have an executor that has been successfully assigned.

if (!normalExit
 && appInfo.incrementRetryCount() >= MAX_EXECUTOR_RETRIES
 && MAX_EXECUTOR_RETRIES >= 0) { // < 0 disables this application-killing path
 val execs = appInfo.executors.values
 if (!execs.exists(_.state == ExecutorState.RUNNING)) {
 logError(s"Application ${appInfo.desc.name} with ID ${appInfo.id} failed " +
 s"${appInfo.retryCount} times; removing it")
 removeApplication(appInfo, ApplicationState.FAILED)
 }
}

Master will only judge whether the worker is available according to the 
resources of the worker. 

// Filter out workers that don't have enough resources to launch an executor
val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
 .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&
 worker.coresFree >= coresPerExecutor)
 .sortBy(_.coresFree).reverse

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26522) Auth secret error in RBackendAuthHandler

2019-01-03 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732762#comment-16732762
 ] 

Hyukjin Kwon commented on SPARK-26522:
--

I think it's something Livy should support rather than letting Spark support 
insecure connection.
See also SPARK-26019. I don't think it's going to be at the very least fixed in 
Spark 3.0.

WDYT [~vanzin] and [~irashid]?

> Auth secret error in RBackendAuthHandler
> 
>
> Key: SPARK-26522
> URL: https://issues.apache.org/jira/browse/SPARK-26522
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1
>Reporter: Euijun
>Priority: Minor
>
> Hi expert,
> I try to use livy to connect sparkR backend.
> This is related to 
> [https://stackoverflow.com/questions/53900995/livy-spark-r-issue]
>  
> Error message is,
> {code:java}
> Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, : 
> Auth secret not provided in environment.{code}
>  
> caused by,
> spark-2.3.1/R/pkg/R/sparkR.R
> {code:java}
> sparkR.sparkContext <- function(
> ...
>     authSecret <- Sys.getenv("SPARKR_BACKEND_AUTH_SECRET")
>     if (nchar(authSecret) == 0) {
>   stop("Auth secret not provided in environment.")
>     }
> ...
> )
> {code}
>  
> Best regard.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26522) Auth secret error in RBackendAuthHandler

2019-01-03 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-26522:


Assignee: (was: Matt Cheah)

> Auth secret error in RBackendAuthHandler
> 
>
> Key: SPARK-26522
> URL: https://issues.apache.org/jira/browse/SPARK-26522
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1
>Reporter: Euijun
>Priority: Minor
>
> Hi expert,
> I try to use livy to connect sparkR backend.
> This is related to 
> [https://stackoverflow.com/questions/53900995/livy-spark-r-issue]
>  
> Error message is,
> {code:java}
> Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, : 
> Auth secret not provided in environment.{code}
>  
> caused by,
> spark-2.3.1/R/pkg/R/sparkR.R
> {code:java}
> sparkR.sparkContext <- function(
> ...
>     authSecret <- Sys.getenv("SPARKR_BACKEND_AUTH_SECRET")
>     if (nchar(authSecret) == 0) {
>   stop("Auth secret not provided in environment.")
>     }
> ...
> )
> {code}
>  
> Best regard.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26522) Auth secret error in RBackendAuthHandler

2019-01-03 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26522:
-
Labels:   (was: newbie)

> Auth secret error in RBackendAuthHandler
> 
>
> Key: SPARK-26522
> URL: https://issues.apache.org/jira/browse/SPARK-26522
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1
>Reporter: Euijun
>Assignee: Matt Cheah
>Priority: Minor
>
> Hi expert,
> I try to use livy to connect sparkR backend.
> This is related to 
> [https://stackoverflow.com/questions/53900995/livy-spark-r-issue]
>  
> Error message is,
> {code:java}
> Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, : 
> Auth secret not provided in environment.{code}
>  
> caused by,
> spark-2.3.1/R/pkg/R/sparkR.R
> {code:java}
> sparkR.sparkContext <- function(
> ...
>     authSecret <- Sys.getenv("SPARKR_BACKEND_AUTH_SECRET")
>     if (nchar(authSecret) == 0) {
>   stop("Auth secret not provided in environment.")
>     }
> ...
> )
> {code}
>  
> Best regard.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26519) spark sql CHANGE COLUMN not working

2019-01-03 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26519.
--
Resolution: Duplicate

> spark sql   CHANGE COLUMN not working 
> --
>
> Key: SPARK-26519
> URL: https://issues.apache.org/jira/browse/SPARK-26519
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: !image-2019-01-02-14-25-34-594.png!
>Reporter: suman gorantla
>Priority: Major
> Attachments: sparksql error.PNG
>
>
> Dear Team,
> with spark sql I am unable to change the newly added column() position after 
> an existing column in the table (old_column) of a hive external table please 
> see the screenshot as in below 
> scala> sql("ALTER TABLE enterprisedatalakedev.tmptst ADD COLUMNs (new_col  
> STRING)")
>  res14: org.apache.spark.sql.DataFrame = []
>  sql("ALTER TABLE enterprisedatalakedev.tmptst CHANGE COLUMN new_col new_col  
> STRING AFTER old_col ")
>  org.apache.spark.sql.catalyst.parser.ParseException:
>  Operation not allowed: ALTER TABLE table [PARTITION partition_spec] CHANGE 
> COLUMN ... FIRST | AFTER otherCol(line 1, pos 0)
> == SQL ==
>  ALTER TABLE enterprisedatalakedev.tmptst CHANGE COLUMN new_col  new_col  
> STRING AFTER  old_col 
>  ^^^
> at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.operationNotAllowed(ParserUtils.scala:39)
>  at 
> org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitChangeColumn$1.apply(SparkSqlParser.scala:934)
>  at 
> org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitChangeColumn$1.apply(SparkSqlParser.scala:928)
>  at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:99)
>  at 
> org.apache.spark.sql.execution.SparkSqlAstBuilder.visitChangeColumn(SparkSqlParser.scala:928)
>  at 
> org.apache.spark.sql.execution.SparkSqlAstBuilder.visitChangeColumn(SparkSqlParser.scala:55)
>  at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$ChangeColumnContext.accept(SqlBaseParser.java:1485)
>  at 
> org.antlr.v4.runtime.tree.AbstractParseTreeVisitor.visit(AbstractParseTreeVisitor.java:42)
>  at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:71)
>  at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:71)
>  at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:99)
>  at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleStatement(AstBuilder.scala:70)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:69)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:68)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:97)
>  at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623)
>  ... 48 elided
> !image-2019-01-02-14-25-40-980.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26519) spark sql CHANGE COLUMN not working

2019-01-03 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732758#comment-16732758
 ] 

Hyukjin Kwon commented on SPARK-26519:
--

It's a duplicate of SPARK-23890 or SPARK-24602.

> spark sql   CHANGE COLUMN not working 
> --
>
> Key: SPARK-26519
> URL: https://issues.apache.org/jira/browse/SPARK-26519
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: !image-2019-01-02-14-25-34-594.png!
>Reporter: suman gorantla
>Priority: Major
> Attachments: sparksql error.PNG
>
>
> Dear Team,
> with spark sql I am unable to change the newly added column() position after 
> an existing column in the table (old_column) of a hive external table please 
> see the screenshot as in below 
> scala> sql("ALTER TABLE enterprisedatalakedev.tmptst ADD COLUMNs (new_col  
> STRING)")
>  res14: org.apache.spark.sql.DataFrame = []
>  sql("ALTER TABLE enterprisedatalakedev.tmptst CHANGE COLUMN new_col new_col  
> STRING AFTER old_col ")
>  org.apache.spark.sql.catalyst.parser.ParseException:
>  Operation not allowed: ALTER TABLE table [PARTITION partition_spec] CHANGE 
> COLUMN ... FIRST | AFTER otherCol(line 1, pos 0)
> == SQL ==
>  ALTER TABLE enterprisedatalakedev.tmptst CHANGE COLUMN new_col  new_col  
> STRING AFTER  old_col 
>  ^^^
> at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.operationNotAllowed(ParserUtils.scala:39)
>  at 
> org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitChangeColumn$1.apply(SparkSqlParser.scala:934)
>  at 
> org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitChangeColumn$1.apply(SparkSqlParser.scala:928)
>  at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:99)
>  at 
> org.apache.spark.sql.execution.SparkSqlAstBuilder.visitChangeColumn(SparkSqlParser.scala:928)
>  at 
> org.apache.spark.sql.execution.SparkSqlAstBuilder.visitChangeColumn(SparkSqlParser.scala:55)
>  at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$ChangeColumnContext.accept(SqlBaseParser.java:1485)
>  at 
> org.antlr.v4.runtime.tree.AbstractParseTreeVisitor.visit(AbstractParseTreeVisitor.java:42)
>  at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:71)
>  at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:71)
>  at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:99)
>  at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleStatement(AstBuilder.scala:70)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:69)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:68)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:97)
>  at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623)
>  ... 48 elided
> !image-2019-01-02-14-25-40-980.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

95 matches

Mail list logo