[jira] [Commented] (SPARK-28330) ANSI SQL: Top-level in

2019-08-07 Thread jiaan.geng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902711#comment-16902711
 ] 

jiaan.geng commented on SPARK-28330:


I'm working.

> ANSI SQL: Top-level  in 
> 
>
> Key: SPARK-28330
> URL: https://issues.apache.org/jira/browse/SPARK-28330
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> h2. {{LIMIT}} and {{OFFSET}}
> LIMIT and OFFSET allow you to retrieve just a portion of the rows that are 
> generated by the rest of the query:
> {noformat}
> SELECT select_list
> FROM table_expression
> [ ORDER BY ... ]
> [ LIMIT { number | ALL } ] [ OFFSET number ]
> {noformat}
> If a limit count is given, no more than that many rows will be returned (but 
> possibly fewer, if the query itself yields fewer rows). LIMIT ALL is the same 
> as omitting the LIMIT clause, as is LIMIT with a NULL argument.
> OFFSET says to skip that many rows before beginning to return rows. OFFSET 0 
> is the same as omitting the OFFSET clause, as is OFFSET with a NULL argument.
> If both OFFSET and LIMIT appear, then OFFSET rows are skipped before starting 
> to count the LIMIT rows that are returned.
> https://www.postgresql.org/docs/11/queries-limit.html
> *Feature ID*: F861



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28613) Spark SQL action collect just judge size of compressed RDD's size, not accurate enough

2019-08-07 Thread Shivu Sondur (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902704#comment-16902704
 ] 

Shivu Sondur commented on SPARK-28613:
--

I will check

> Spark SQL action collect just judge size of compressed RDD's size, not 
> accurate enough
> --
>
> Key: SPARK-28613
> URL: https://issues.apache.org/jira/browse/SPARK-28613
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: angerszhu
>Priority: Major
>
> When we run action DataFrame.collect() , for the configuration 
> *spark.driver.maxResultSize  ,*when determine if the returned data exceeds 
> the limit, it will use the compressed byte array's size, it is not accurate. 
> Since when we get data when use SparkThriftServer, when not use incremental 
> colletion. It will get all data of datafrme for each partition.
> For return data, it has the preocess"
>  # compress data's byte array 
>  # Being packaged as ResultSet
>  # return to driver and judge by *spark.Driver.resultMaxSize*
>  # *decode(uncompress) data as Array[Row]*
> The amount of data unzipped differs significantly from the amount of data 
> unzipped, The difference in the size of the data is more than ten times
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28634) Failed to start SparkSession with Keytab file

2019-08-07 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-28634.
-
Resolution: Not A Problem

> Failed to start SparkSession with Keytab file 
> --
>
> Key: SPARK-28634
> URL: https://issues.apache.org/jira/browse/SPARK-28634
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> [user-etl@hermesdevour002-700165 spark-3.0.0-SNAPSHOT-bin-2.7.4]$ 
> bin/spark-sql --master yarn --conf 
> spark.yarn.keytab=/apache/spark-2.3.0-bin-2.7.3/conf/user-etl.keytab --conf 
> spark.yarn.principal=user-...@prod.example.com
> log4j:WARN No such property [maxFileSize] in 
> org.apache.log4j.rolling.RollingFileAppender.
> log4j:WARN No such property [maxBackupIndex] in 
> org.apache.log4j.rolling.RollingFileAppender.
> Exception in thread "main" org.apache.spark.SparkException: Application 
> application_1564558112805_1794 failed 2 times due to AM Container for 
> appattempt_1564558112805_1794_02 exited with  exitCode: 1
> For more detailed output, check the application tracking page: 
> https://0.0.0.0:8190/applicationhistory/app/application_1564558112805_1794 
> Then click on links to logs of each attempt.
> Diagnostics: Exception from container-launch.
> Container id: container_e1987_1564558112805_1794_02_01
> Exit code: 1
> Shell output: main : command provided 1
> main : run as user is user-etl
> main : requested yarn user is user-etl
> Getting exit code file...
> Creating script paths...
> Writing pid file...
> Writing to tmp file 
> /hadoop/2/yarn/local/nmPrivate/application_1564558112805_1794/container_e1987_1564558112805_1794_02_01/container_e1987_1564558112805_1794_02_01.pid.tmp
> Writing to cgroup task files...
> Creating local dirs...
> Launching container...
> Getting exit code file...
> Creating script paths...
> Container exited with a non-zero exit code 1. Error file: prelaunch.err.
> Last 4096 bytes of prelaunch.err :
> Last 4096 bytes of stderr :
> log4j:WARN No such property [maxFileSize] in 
> org.apache.log4j.rolling.RollingFileAppender.
> log4j:WARN No such property [maxBackupIndex] in 
> org.apache.log4j.rolling.RollingFileAppender.
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/hadoop/2/yarn/local/usercache/user-etl/filecache/58/__spark_libs__4358879230136591830.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/apache/releases/hbase-1.1.2.2.6.4.1/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/apache/releases/hadoop-2.7.3.2.6.4.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> Exception in thread "main" org.apache.spark.SparkException: Keytab file: 
> /apache/spark-2.3.0-bin-2.7.3/conf/user-etl.keytab does not exist
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.loginUserFromKeytab(SparkHadoopUtil.scala:131)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:846)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:889)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
> Failing this attempt. Failing the application.
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:95)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:185)
>   at org.apache.spark.SparkContext.(SparkContext.scala:509)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2466)
>   at 
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:948)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:939)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:48)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:315)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:166)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.i

[jira] [Commented] (SPARK-28634) Failed to start SparkSession with Keytab file

2019-08-07 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902702#comment-16902702
 ] 

Yuming Wang commented on SPARK-28634:
-

Thank you [~vanzin] It works.

> Failed to start SparkSession with Keytab file 
> --
>
> Key: SPARK-28634
> URL: https://issues.apache.org/jira/browse/SPARK-28634
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> [user-etl@hermesdevour002-700165 spark-3.0.0-SNAPSHOT-bin-2.7.4]$ 
> bin/spark-sql --master yarn --conf 
> spark.yarn.keytab=/apache/spark-2.3.0-bin-2.7.3/conf/user-etl.keytab --conf 
> spark.yarn.principal=user-...@prod.example.com
> log4j:WARN No such property [maxFileSize] in 
> org.apache.log4j.rolling.RollingFileAppender.
> log4j:WARN No such property [maxBackupIndex] in 
> org.apache.log4j.rolling.RollingFileAppender.
> Exception in thread "main" org.apache.spark.SparkException: Application 
> application_1564558112805_1794 failed 2 times due to AM Container for 
> appattempt_1564558112805_1794_02 exited with  exitCode: 1
> For more detailed output, check the application tracking page: 
> https://0.0.0.0:8190/applicationhistory/app/application_1564558112805_1794 
> Then click on links to logs of each attempt.
> Diagnostics: Exception from container-launch.
> Container id: container_e1987_1564558112805_1794_02_01
> Exit code: 1
> Shell output: main : command provided 1
> main : run as user is user-etl
> main : requested yarn user is user-etl
> Getting exit code file...
> Creating script paths...
> Writing pid file...
> Writing to tmp file 
> /hadoop/2/yarn/local/nmPrivate/application_1564558112805_1794/container_e1987_1564558112805_1794_02_01/container_e1987_1564558112805_1794_02_01.pid.tmp
> Writing to cgroup task files...
> Creating local dirs...
> Launching container...
> Getting exit code file...
> Creating script paths...
> Container exited with a non-zero exit code 1. Error file: prelaunch.err.
> Last 4096 bytes of prelaunch.err :
> Last 4096 bytes of stderr :
> log4j:WARN No such property [maxFileSize] in 
> org.apache.log4j.rolling.RollingFileAppender.
> log4j:WARN No such property [maxBackupIndex] in 
> org.apache.log4j.rolling.RollingFileAppender.
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/hadoop/2/yarn/local/usercache/user-etl/filecache/58/__spark_libs__4358879230136591830.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/apache/releases/hbase-1.1.2.2.6.4.1/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/apache/releases/hadoop-2.7.3.2.6.4.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> Exception in thread "main" org.apache.spark.SparkException: Keytab file: 
> /apache/spark-2.3.0-bin-2.7.3/conf/user-etl.keytab does not exist
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.loginUserFromKeytab(SparkHadoopUtil.scala:131)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:846)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:889)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
> Failing this attempt. Failing the application.
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:95)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:185)
>   at org.apache.spark.SparkContext.(SparkContext.scala:509)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2466)
>   at 
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:948)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:939)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:48)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:315)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:166)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Meth

[jira] [Updated] (SPARK-28330) ANSI SQL: Top-level in

2019-08-07 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28330:

Description: 
h2. {{LIMIT}} and {{OFFSET}}

LIMIT and OFFSET allow you to retrieve just a portion of the rows that are 
generated by the rest of the query:
{noformat}
SELECT select_list
FROM table_expression
[ ORDER BY ... ]
[ LIMIT { number | ALL } ] [ OFFSET number ]
{noformat}
If a limit count is given, no more than that many rows will be returned (but 
possibly fewer, if the query itself yields fewer rows). LIMIT ALL is the same 
as omitting the LIMIT clause, as is LIMIT with a NULL argument.

OFFSET says to skip that many rows before beginning to return rows. OFFSET 0 is 
the same as omitting the OFFSET clause, as is OFFSET with a NULL argument.

If both OFFSET and LIMIT appear, then OFFSET rows are skipped before starting 
to count the LIMIT rows that are returned.

https://www.postgresql.org/docs/11/queries-limit.html

*Feature ID*: F861

  was:
h2. {{LIMIT}} and {{OFFSET}}

LIMIT and OFFSET allow you to retrieve just a portion of the rows that are 
generated by the rest of the query:
{noformat}
SELECT select_list
FROM table_expression
[ ORDER BY ... ]
[ LIMIT { number | ALL } ] [ OFFSET number ]
{noformat}
If a limit count is given, no more than that many rows will be returned (but 
possibly fewer, if the query itself yields fewer rows). LIMIT ALL is the same 
as omitting the LIMIT clause, as is LIMIT with a NULL argument.

OFFSET says to skip that many rows before beginning to return rows. OFFSET 0 is 
the same as omitting the OFFSET clause, as is OFFSET with a NULL argument.

If both OFFSET and LIMIT appear, then OFFSET rows are skipped before starting 
to count the LIMIT rows that are returned.

https://www.postgresql.org/docs/11/queries-limit.html


> ANSI SQL: Top-level  in 
> 
>
> Key: SPARK-28330
> URL: https://issues.apache.org/jira/browse/SPARK-28330
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> h2. {{LIMIT}} and {{OFFSET}}
> LIMIT and OFFSET allow you to retrieve just a portion of the rows that are 
> generated by the rest of the query:
> {noformat}
> SELECT select_list
> FROM table_expression
> [ ORDER BY ... ]
> [ LIMIT { number | ALL } ] [ OFFSET number ]
> {noformat}
> If a limit count is given, no more than that many rows will be returned (but 
> possibly fewer, if the query itself yields fewer rows). LIMIT ALL is the same 
> as omitting the LIMIT clause, as is LIMIT with a NULL argument.
> OFFSET says to skip that many rows before beginning to return rows. OFFSET 0 
> is the same as omitting the OFFSET clause, as is OFFSET with a NULL argument.
> If both OFFSET and LIMIT appear, then OFFSET rows are skipped before starting 
> to count the LIMIT rows that are returned.
> https://www.postgresql.org/docs/11/queries-limit.html
> *Feature ID*: F861



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28330) ANSI SQL: Top-level in

2019-08-07 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28330:

Summary: ANSI SQL: Top-level  in   
(was: Enhance query limit)

> ANSI SQL: Top-level  in 
> 
>
> Key: SPARK-28330
> URL: https://issues.apache.org/jira/browse/SPARK-28330
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> h2. {{LIMIT}} and {{OFFSET}}
> LIMIT and OFFSET allow you to retrieve just a portion of the rows that are 
> generated by the rest of the query:
> {noformat}
> SELECT select_list
> FROM table_expression
> [ ORDER BY ... ]
> [ LIMIT { number | ALL } ] [ OFFSET number ]
> {noformat}
> If a limit count is given, no more than that many rows will be returned (but 
> possibly fewer, if the query itself yields fewer rows). LIMIT ALL is the same 
> as omitting the LIMIT clause, as is LIMIT with a NULL argument.
> OFFSET says to skip that many rows before beginning to return rows. OFFSET 0 
> is the same as omitting the OFFSET clause, as is OFFSET with a NULL argument.
> If both OFFSET and LIMIT appear, then OFFSET rows are skipped before starting 
> to count the LIMIT rows that are returned.
> https://www.postgresql.org/docs/11/queries-limit.html



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28454) Validate LongType in _make_type_verifier

2019-08-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28454.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25117
[https://github.com/apache/spark/pull/25117]

> Validate LongType in _make_type_verifier
> 
>
> Key: SPARK-28454
> URL: https://issues.apache.org/jira/browse/SPARK-28454
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.3
>Reporter: AY
>Assignee: AY
>Priority: Major
> Fix For: 3.0.0
>
>
> {{pyspark.sql.types._make_type_verifier doesn't validate LongType values 
> range.}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28454) Validate LongType in _make_type_verifier

2019-08-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28454:


Assignee: AY

> Validate LongType in _make_type_verifier
> 
>
> Key: SPARK-28454
> URL: https://issues.apache.org/jira/browse/SPARK-28454
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.3
>Reporter: AY
>Assignee: AY
>Priority: Major
>
> {{pyspark.sql.types._make_type_verifier doesn't validate LongType values 
> range.}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-28474) Lower JDBC client cannot read binary type

2019-08-07 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28474:

Comment: was deleted

(was: I'm working on.)

> Lower JDBC client cannot read binary type
> -
>
> Key: SPARK-28474
> URL: https://issues.apache.org/jira/browse/SPARK-28474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Logs:
> {noformat}
> java.lang.RuntimeException: java.lang.ClassCastException: [B incompatible 
> with java.lang.String
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:83)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
>   at 
> java.security.AccessController.doPrivileged(AccessController.java:770)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
>   at com.sun.proxy.$Proxy26.fetchResults(Unknown Source)
>   at 
> org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:455)
>   at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:621)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
>   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
>   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>   at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53)
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:819)
> Caused by: java.lang.ClassCastException: [B incompatible with java.lang.String
>   at 
> org.apache.hive.service.cli.ColumnValue.toTColumnValue(ColumnValue.java:198)
>   at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:60)
>   at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:32)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(SparkExecuteStatementOperation.scala:148)
>   at 
> org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:220)
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:785)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
>   ... 18 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28649) Git Ignore does not ignore python/.eggs

2019-08-07 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28649:
-

Assignee: Rob Vesse

> Git Ignore does not ignore python/.eggs
> ---
>
> Key: SPARK-28649
> URL: https://issues.apache.org/jira/browse/SPARK-28649
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.3
>Reporter: Rob Vesse
>Assignee: Rob Vesse
>Priority: Major
>
> Currently the {{python/.eggs}} folder is not in the {{.gitignore}} file.  If 
> you are building a Spark distribution from your working copy and enabling 
> Python distribution as part of that you'll end up with this folder present 
> and Git will always warn you that it has untracked changes as a result.  
> Since this directory contains transient build artifacts this should be 
> ignored.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28649) Git Ignore does not ignore python/.eggs

2019-08-07 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28649.
---
   Resolution: Fixed
Fix Version/s: 2.4.4
   3.0.0

Issue resolved by pull request 25380
[https://github.com/apache/spark/pull/25380]

> Git Ignore does not ignore python/.eggs
> ---
>
> Key: SPARK-28649
> URL: https://issues.apache.org/jira/browse/SPARK-28649
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.3
>Reporter: Rob Vesse
>Assignee: Rob Vesse
>Priority: Major
> Fix For: 3.0.0, 2.4.4
>
>
> Currently the {{python/.eggs}} folder is not in the {{.gitignore}} file.  If 
> you are building a Spark distribution from your working copy and enabling 
> Python distribution as part of that you'll end up with this folder present 
> and Git will always warn you that it has untracked changes as a result.  
> Since this directory contains transient build artifacts this should be 
> ignored.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28617) Fix misplacement when comment is at the end of the query

2019-08-07 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28617:
-

Assignee: Yuming Wang

> Fix misplacement when comment is at the end of the query
> 
>
> Key: SPARK-28617
> URL: https://issues.apache.org/jira/browse/SPARK-28617
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28617) Fix misplacement when comment is at the end of the query

2019-08-07 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28617.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25357
[https://github.com/apache/spark/pull/25357]

> Fix misplacement when comment is at the end of the query
> 
>
> Key: SPARK-28617
> URL: https://issues.apache.org/jira/browse/SPARK-28617
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28331) Catalogs.load always throws CatalogNotFoundException on loading built-in catalogs

2019-08-07 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-28331.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Resolved with [https://github.com/apache/spark/pull/25348]

> Catalogs.load always throws CatalogNotFoundException on loading built-in 
> catalogs
> -
>
> Key: SPARK-28331
> URL: https://issues.apache.org/jira/browse/SPARK-28331
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> In `Catalogs.load`, the `pluginClassName` in the following code 
> ```
> String pluginClassName = conf.getConfString("spark.sql.catalog." + name, 
> null);
> ```
> is always null for built-in catalogs, e.g there is a SQLConf entry for 
> `spark.sql.catalog.session`.
> This is because of https://github.com/apache/spark/pull/18852: 
> SQLConf.conf.getConfString(key, null)  always returns null.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28331) Catalogs.load always throws CatalogNotFoundException on loading built-in catalogs

2019-08-07 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-28331:
---

Assignee: Gengliang Wang

> Catalogs.load always throws CatalogNotFoundException on loading built-in 
> catalogs
> -
>
> Key: SPARK-28331
> URL: https://issues.apache.org/jira/browse/SPARK-28331
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> In `Catalogs.load`, the `pluginClassName` in the following code 
> ```
> String pluginClassName = conf.getConfString("spark.sql.catalog." + name, 
> null);
> ```
> is always null for built-in catalogs, e.g there is a SQLConf entry for 
> `spark.sql.catalog.session`.
> This is because of https://github.com/apache/spark/pull/18852: 
> SQLConf.conf.getConfString(key, null)  always returns null.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28470) Honor spark.sql.decimalOperations.nullOnOverflow in Cast

2019-08-07 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-28470.
--
   Resolution: Fixed
 Assignee: Marco Gaido
Fix Version/s: 3.0.0

Resolved by [https://github.com/apache/spark/pull/25253]

> Honor spark.sql.decimalOperations.nullOnOverflow in Cast
> 
>
> Key: SPARK-28470
> URL: https://issues.apache.org/jira/browse/SPARK-28470
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 3.0.0
>
>
> cast long to decimal or decimal to decimal can overflow, we should respect 
> the new config if overflow happens.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28652) spark.kubernetes.pyspark.pythonVersion is never passed to executors

2019-08-07 Thread nanav yorbiz (JIRA)
nanav yorbiz created SPARK-28652:


 Summary: spark.kubernetes.pyspark.pythonVersion is never passed to 
executors
 Key: SPARK-28652
 URL: https://issues.apache.org/jira/browse/SPARK-28652
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.4.3
Reporter: nanav yorbiz


I suppose this may not be a priority with Python2 on its way out, but given 
that this setting is only ever sent to the driver and not the executors, no 
actual work can be performed when the versions don't match, which will tend to 
be *always* with the default setting for the driver being changed from 2 to 3, 
and the executors using `python`, which defaults to v2, by default.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28651) Streaming file source doesn't change the schema to nullable automatically

2019-08-07 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-28651:


Assignee: Shixiong Zhu

> Streaming file source doesn't change the schema to nullable automatically
> -
>
> Key: SPARK-28651
> URL: https://issues.apache.org/jira/browse/SPARK-28651
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Tomasz
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: release-notes
>
> Right now, batch DataFrame always changes the schema to nullable 
> automatically (See this line: 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).
> However, streaming DataFrame's schema is read in this line 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
>  which doesn't change the schema to nullable automatically.
> We should make streaming DataFrame consistent with batch.
> It can cause corrupted parquet files due to the schema mismatch.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28651) Streaming file source doesn't change the schema to nullable automatically

2019-08-07 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-28651:
-
Labels: release-notes  (was: )

> Streaming file source doesn't change the schema to nullable automatically
> -
>
> Key: SPARK-28651
> URL: https://issues.apache.org/jira/browse/SPARK-28651
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Tomasz
>Priority: Major
>  Labels: release-notes
>
> Right now, batch DataFrame always changes the schema to nullable 
> automatically (See this line: 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).
> However, streaming DataFrame's schema is read in this line 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
>  which doesn't change the schema to nullable automatically.
> We should make streaming DataFrame consistent with batch.
> It can cause corrupted parquet files due to the schema mismatch.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28651) Streaming file source doesn't change the schema to nullable automatically

2019-08-07 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-28651:
-
Description: 
Right now, batch DataFrame always changes the schema to nullable automatically 
(See this line: 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).

However, streaming DataFrame's schema is read in this line 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
 which doesn't change the schema to nullable automatically.

We should make streaming DataFrame consistent with batch.

It can cause corrupted parquet files due to the schema mismatch.

  was:
Right now, batch DataFrame always changes the schema to nullable automatically 
(See this line: 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).

However, streaming DataFrame's schema is read in this line 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
 which doesn't change the schema to nullable automatically.

We should make streaming DataFrame consistent with batch.

Tcaused corrupted parquet files due to the schema mismatch.


> Streaming file source doesn't change the schema to nullable automatically
> -
>
> Key: SPARK-28651
> URL: https://issues.apache.org/jira/browse/SPARK-28651
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Tomasz
>Priority: Major
>
> Right now, batch DataFrame always changes the schema to nullable 
> automatically (See this line: 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).
> However, streaming DataFrame's schema is read in this line 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
>  which doesn't change the schema to nullable automatically.
> We should make streaming DataFrame consistent with batch.
> It can cause corrupted parquet files due to the schema mismatch.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28651) Streaming file source doesn't change the schema to nullable automatically

2019-08-07 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-28651:
-
Description: 
Right now, batch DataFrame always changes the schema to nullable automatically 
(See this line: 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).

However, streaming DataFrame's schema is read in this line 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
 which doesn't change the schema to nullable automatically.

We should make streaming DataFrame consistent with batch.

Tcaused corrupted parquet files due to the schema mismatch.

  was:
Right now, batch DataFrame always changes the schema to nullable automatically 
(See this line: 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).

However, streaming DataFrame's schema is read in this line 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
 which doesn't change the schema to nullable automatically.

We should make streaming DataFrame consistent with batch.

This issue was reported by Tomasz Magdanski. He found it caused corrupted 
parquet files due to the schema mismatch.


> Streaming file source doesn't change the schema to nullable automatically
> -
>
> Key: SPARK-28651
> URL: https://issues.apache.org/jira/browse/SPARK-28651
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Tomasz
>Priority: Major
>
> Right now, batch DataFrame always changes the schema to nullable 
> automatically (See this line: 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).
> However, streaming DataFrame's schema is read in this line 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
>  which doesn't change the schema to nullable automatically.
> We should make streaming DataFrame consistent with batch.
> Tcaused corrupted parquet files due to the schema mismatch.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28651) Streaming file source doesn't change the schema to nullable automatically

2019-08-07 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-28651:
-
Reporter: Tomasz  (was: Shixiong Zhu)

> Streaming file source doesn't change the schema to nullable automatically
> -
>
> Key: SPARK-28651
> URL: https://issues.apache.org/jira/browse/SPARK-28651
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Tomasz
>Priority: Major
>
> Right now, batch DataFrame always changes the schema to nullable 
> automatically (See this line: 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).
> However, streaming DataFrame's schema is read in this line 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
>  which doesn't change the schema to nullable automatically.
> We should make streaming DataFrame consistent with batch.
> This issue was reported by Tomasz Magdanski. He found it caused corrupted 
> parquet files due to the schema mismatch.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28651) Streaming file source doesn't change the schema to nullable automatically

2019-08-07 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-28651:
-
Description: 
Right now, batch DataFrame always changes the schema to nullable automatically 
(See this line: 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).

However, streaming DataFrame's schema is read in this line 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
 which doesn't change the schema to nullable automatically.

We should make streaming DataFrame consistent with batch.

This issue was reported by Tomasz Magdanski. He found it caused corrupted 
parquet files due to the schema mismatch.

  was:
Right now, batch DataFrame always changes the schema to nullable automatically 
(See this line: 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).

However, streaming DataFrame's schema is read in this line 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
 which doesn't change the schema to nullable automatically.

We should make streaming DataFrame consistent with batch.


> Streaming file source doesn't change the schema to nullable automatically
> -
>
> Key: SPARK-28651
> URL: https://issues.apache.org/jira/browse/SPARK-28651
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Priority: Major
>
> Right now, batch DataFrame always changes the schema to nullable 
> automatically (See this line: 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).
> However, streaming DataFrame's schema is read in this line 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
>  which doesn't change the schema to nullable automatically.
> We should make streaming DataFrame consistent with batch.
> This issue was reported by Tomasz Magdanski. He found it caused corrupted 
> parquet files due to the schema mismatch.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28651) Streaming file source doesn't change the schema to nullable automatically

2019-08-07 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-28651:
-
Description: 
Right now, batch DataFrame always changes the schema to nullable automatically 
(See this line: 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).

However, streaming DataFrame's schema is read in this line 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
 which doesn't change the schema to nullable automatically.

We should make streaming DataFrame consistent with batch.

  was:
Right now, batch DataFrame always changes the schema to nullable automatically 
(See this line: 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).

However, streaming DataFrame's schema is read in this line 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
 which doesn't change the schema to nullable automatically.

We should make streaming DataFrame consistent with batch.

This issue was rpo


> Streaming file source doesn't change the schema to nullable automatically
> -
>
> Key: SPARK-28651
> URL: https://issues.apache.org/jira/browse/SPARK-28651
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Priority: Major
>
> Right now, batch DataFrame always changes the schema to nullable 
> automatically (See this line: 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).
> However, streaming DataFrame's schema is read in this line 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
>  which doesn't change the schema to nullable automatically.
> We should make streaming DataFrame consistent with batch.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28651) Streaming file source doesn't change the schema to nullable automatically

2019-08-07 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-28651:
-
Description: 
Right now, batch DataFrame always changes the schema to nullable automatically 
(See this line: 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).

However, streaming DataFrame's schema is read in this line 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
 which doesn't change the schema to nullable automatically.

We should make streaming DataFrame consistent with batch.

This issue was rpo

  was:
Right now, batch DataFrame always changes the schema to nullable automatically 
(See this line: 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).

However, streaming DataFrame's schema is read in this line 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
 which doesn't change the schema to nullable automatically.

We should make streaming DataFrame consistent with batch.


> Streaming file source doesn't change the schema to nullable automatically
> -
>
> Key: SPARK-28651
> URL: https://issues.apache.org/jira/browse/SPARK-28651
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Priority: Major
>
> Right now, batch DataFrame always changes the schema to nullable 
> automatically (See this line: 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).
> However, streaming DataFrame's schema is read in this line 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
>  which doesn't change the schema to nullable automatically.
> We should make streaming DataFrame consistent with batch.
> This issue was rpo



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28651) Streaming file source doesn't change the schema to nullable automatically

2019-08-07 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-28651:


 Summary: Streaming file source doesn't change the schema to 
nullable automatically
 Key: SPARK-28651
 URL: https://issues.apache.org/jira/browse/SPARK-28651
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.3
Reporter: Shixiong Zhu


Right now, batch DataFrame always changes the schema to nullable automatically 
(See this line: 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).

However, streaming DataFrame's schema is read in this line 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
 which doesn't change the schema to nullable automatically.

We should make streaming DataFrame consistent with batch.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27492) GPU scheduling - High level user documentation

2019-08-07 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-27492:
--
Description: 
For the SPIP - Accelerator-aware task scheduling for Spark, 
https://issues.apache.org/jira/browse/SPARK-24615 Add some high level user 
documentation about how this feature works together and point to things like 
the example discovery script, etc.

 
 - make sure to document the discovery script and what permissions are needed 
and any security implications
 - Document standalone - local-cluster mode limitation of only a single 
resource file or discovery script so you have to have coordination on for it to 
work right.

  was:
For the SPIP - Accelerator-aware task scheduling for Spark, 
https://issues.apache.org/jira/browse/SPARK-24615 Add some high level user 
documentation about how this feature works together and point to things like 
the example discovery script, etc.

 

- make sure to document the discovery script and what permissions are needed 
and any security implications


> GPU scheduling - High level user documentation
> --
>
> Key: SPARK-27492
> URL: https://issues.apache.org/jira/browse/SPARK-27492
> Project: Spark
>  Issue Type: Story
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> For the SPIP - Accelerator-aware task scheduling for Spark, 
> https://issues.apache.org/jira/browse/SPARK-24615 Add some high level user 
> documentation about how this feature works together and point to things like 
> the example discovery script, etc.
>  
>  - make sure to document the discovery script and what permissions are needed 
> and any security implications
>  - Document standalone - local-cluster mode limitation of only a single 
> resource file or discovery script so you have to have coordination on for it 
> to work right.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26152) Synchronize Worker Cleanup with Worker Shutdown

2019-08-07 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902477#comment-16902477
 ] 

Shixiong Zhu edited comment on SPARK-26152 at 8/7/19 9:07 PM:
--

[~ajithshetty] does your PR fix the flaky test? If I read correctly, the 
failure is because of OOM. RejectedExecutionException is just side effect 
because OOM triggers an unexpected executor shut down. Should we try to figure 
out why OOM happens? Maybe we have some memory leak.


was (Author: zsxwing):
[~ajithshetty] does your PR fix the flaky test? If I read correctly, the 
failure is because of OOM. RejectedExecutionException is just side effect 
because OOM triggers the executor shut down. Should we try to figure out why 
OOM happens? Maybe we have some memory leak.

> Synchronize Worker Cleanup with Worker Shutdown
> ---
>
> Key: SPARK-26152
> URL: https://issues.apache.org/jira/browse/SPARK-26152
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3
>Reporter: Dongjoon Hyun
>Assignee: Ajith S
>Priority: Critical
> Fix For: 2.4.4, 3.0.0
>
> Attachments: Screenshot from 2019-03-11 17-03-40.png
>
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5627
>  (2018-11-16)
> {code}
> BroadcastSuite:
> - Using TorrentBroadcast locally
> - Accessing TorrentBroadcast variables from multiple threads
> - Accessing TorrentBroadcast variables in a local cluster (encryption = off)
> java.util.concurrent.RejectedExecutionException: Task 
> scala.concurrent.impl.CallbackRunnable@59428a1 rejected from 
> java.util.concurrent.ThreadPoolExecutor@4096a677[Shutting down, pool size = 
> 1, active threads = 1, queued tasks = 0, completed tasks = 0]
>   at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
>   at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
>   at 
> java.util.concurrent.Executors$DelegatedExecutorService.execute(Executors.java:668)
>   at 
> scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:134)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284)
>   at scala.concurrent.Promise.complete(Promise.scala:49)
>   at scala.concurrent.Promise.complete$(Promise.scala:48)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:183)
>   at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
>   at 
> scala.concurrent.BatchingExecutor$Batch.processBatch$1(BatchingExecutor.scala:63)
>   at 
> scala.concurrent.BatchingExecutor$Batch.$anonfun$run$1(BatchingExecutor.scala:78)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
>   at 
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81)
>   at 
> scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:55)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:870)
>   at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:106)
>   at 
> scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:868)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284)
>   at scala.concurrent.Promise.complete(Promise.scala:49)
>   at scala.concurrent.Promise.complete$(Promise.scala:48)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:183)
>   at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.Threa

[jira] [Commented] (SPARK-26152) Synchronize Worker Cleanup with Worker Shutdown

2019-08-07 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902477#comment-16902477
 ] 

Shixiong Zhu commented on SPARK-26152:
--

[~ajithshetty] does your PR fix the flaky test? If I read correctly, the 
failure is because of OOM. RejectedExecutionException is just side effect 
because OOM triggers the executor shut down. Should we try to figure out why 
OOM happens? Maybe we have some memory leak.

> Synchronize Worker Cleanup with Worker Shutdown
> ---
>
> Key: SPARK-26152
> URL: https://issues.apache.org/jira/browse/SPARK-26152
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3
>Reporter: Dongjoon Hyun
>Assignee: Ajith S
>Priority: Critical
> Fix For: 2.4.4, 3.0.0
>
> Attachments: Screenshot from 2019-03-11 17-03-40.png
>
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5627
>  (2018-11-16)
> {code}
> BroadcastSuite:
> - Using TorrentBroadcast locally
> - Accessing TorrentBroadcast variables from multiple threads
> - Accessing TorrentBroadcast variables in a local cluster (encryption = off)
> java.util.concurrent.RejectedExecutionException: Task 
> scala.concurrent.impl.CallbackRunnable@59428a1 rejected from 
> java.util.concurrent.ThreadPoolExecutor@4096a677[Shutting down, pool size = 
> 1, active threads = 1, queued tasks = 0, completed tasks = 0]
>   at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
>   at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
>   at 
> java.util.concurrent.Executors$DelegatedExecutorService.execute(Executors.java:668)
>   at 
> scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:134)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284)
>   at scala.concurrent.Promise.complete(Promise.scala:49)
>   at scala.concurrent.Promise.complete$(Promise.scala:48)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:183)
>   at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
>   at 
> scala.concurrent.BatchingExecutor$Batch.processBatch$1(BatchingExecutor.scala:63)
>   at 
> scala.concurrent.BatchingExecutor$Batch.$anonfun$run$1(BatchingExecutor.scala:78)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
>   at 
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81)
>   at 
> scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:55)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:870)
>   at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:106)
>   at 
> scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:868)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284)
>   at scala.concurrent.Promise.complete(Promise.scala:49)
>   at scala.concurrent.Promise.complete$(Promise.scala:48)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:183)
>   at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> java.util.concurrent.RejectedExecutionException: Task 
> scala.concurrent.impl.CallbackRunnable@40a5bf17 rejected from 
> java.util.concurrent.ThreadPoolExecutor@5a73967[Shutting down, pool size = 1, 
> active threads = 1, queued tasks = 0, complete

[jira] [Created] (SPARK-28650) Fix the guarantee of ForeachWriter

2019-08-07 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-28650:


 Summary: Fix the guarantee of ForeachWriter
 Key: SPARK-28650
 URL: https://issues.apache.org/jira/browse/SPARK-28650
 Project: Spark
  Issue Type: Documentation
  Components: Structured Streaming
Affects Versions: 2.4.3
Reporter: Shixiong Zhu


Right now ForeachWriter has the following guarantee:

{code}

If the streaming query is being executed in the micro-batch mode, then every 
partition
represented by a unique tuple (partitionId, epochId) is guaranteed to have the 
same data.
Hence, (partitionId, epochId) can be used to deduplicate and/or transactionally 
commit data
and achieve exactly-once guarantees.

{code}

 

But we can break this easily actually when restarting a query but a batch is 
re-run (e.g., upgrade Spark)
 * Source returns a different DataFrame that has a different partition number 
(e.g., we start to not create empty partitions in Kafka Source V2).
 * A new added optimization rule may change the number of partitions in the new 
run.
 * Change the file split size in the new run.

Since we cannot guarantee that the same (partitionId, epochId) has the same 
data. We should update the document for "ForeachWriter".



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28583) Subqueries should not call `onUpdatePlan` in Adaptive Query Execution

2019-08-07 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-28583.
---
   Resolution: Fixed
 Assignee: Maryann Xue
Fix Version/s: 3.0.0

> Subqueries should not call `onUpdatePlan` in Adaptive Query Execution
> -
>
> Key: SPARK-28583
> URL: https://issues.apache.org/jira/browse/SPARK-28583
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maryann Xue
>Assignee: Maryann Xue
>Priority: Major
> Fix For: 3.0.0
>
>
> Subqueries do not have their own execution id, thus when calling 
> {{AdaptiveSparkPlanExec.onUpdatePlan}}, it will actually get the 
> {{QueryExecution}} instance of the main query, which is wasteful and 
> problematic. It could cause issues like stack overflow or dead locks in some 
> circumstances.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28649) Git Ignore does not ignore python/.eggs

2019-08-07 Thread Rob Vesse (JIRA)
Rob Vesse created SPARK-28649:
-

 Summary: Git Ignore does not ignore python/.eggs
 Key: SPARK-28649
 URL: https://issues.apache.org/jira/browse/SPARK-28649
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.4.3
Reporter: Rob Vesse


Currently the {{python/.eggs}} folder is not in the {{.gitignore}} file.  If 
you are building a Spark distribution from your working copy and enabling 
Python distribution as part of that you'll end up with this folder present and 
Git will always warn you that it has untracked changes as a result.  Since this 
directory contains transient build artifacts this should be ignored.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28648) Adds support to `groups` unit type in window clauses

2019-08-07 Thread Dylan Guedes (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dylan Guedes updated SPARK-28648:
-
Summary: Adds support to `groups` unit type in window clauses  (was: Adds 
support to `groups` in window clauses)

> Adds support to `groups` unit type in window clauses
> 
>
> Key: SPARK-28648
> URL: https://issues.apache.org/jira/browse/SPARK-28648
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dylan Guedes
>Priority: Major
>
> Spark currently support the two most common window functions unit types: rows 
> and ranges. However, in PgSQL a new type was added: `groups`. 
> According to [this 
> source|https://blog.jooq.org/2018/07/05/postgresql-11s-support-for-sql-standard-groups-and-exclude-window-function-clauses/],
>  the difference is:
> """ROWS counts the exact number of rows in the frame.
> RANGE performs logical windowing where we don’t count the number of rows, but 
> look for a value offset.
> GROUPS counts all groups of tied rows within the window."""



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28648) Adds support to `groups` in window clauses

2019-08-07 Thread Dylan Guedes (JIRA)
Dylan Guedes created SPARK-28648:


 Summary: Adds support to `groups` in window clauses
 Key: SPARK-28648
 URL: https://issues.apache.org/jira/browse/SPARK-28648
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Dylan Guedes


Spark currently support the two most common window functions unit types: rows 
and ranges. However, in PgSQL a new type was added: `groups`. 

According to [this 
source|https://blog.jooq.org/2018/07/05/postgresql-11s-support-for-sql-standard-groups-and-exclude-window-function-clauses/],
 the difference is:
"""ROWS counts the exact number of rows in the frame.
RANGE performs logical windowing where we don’t count the number of rows, but 
look for a value offset.
GROUPS counts all groups of tied rows within the window."""




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28647) Remove additional-metrics.js

2019-08-07 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-28647:
--

 Summary: Remove additional-metrics.js
 Key: SPARK-28647
 URL: https://issues.apache.org/jira/browse/SPARK-28647
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.0.0
Reporter: Kousuke Saruta


After stagepage.js was introduced, additional-metrics.js is no longer used so 
let's remove it. It's just a cleanup so I don't think it's worth filing. But if 
you think it's needed, please let me know.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28124) Faster S3 file source with SQS

2019-08-07 Thread Abhishek Dixit (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902382#comment-16902382
 ] 

Abhishek Dixit edited comment on SPARK-28124 at 8/7/19 6:16 PM:


As suggested by [~gsomogyi], [~ste...@apache.org], [~zsxwing] I've opened a 
[pull request|https://github.com/apache/bahir/pull/91] in BAHIR for this, so 
closing this ticket. 


was (Author: abhishekd0907):
As suggested by [~gsomogyi], [~ste...@apache.org], [~zsxwing] I've opened a 
[pull request |[https://github.com/apache/bahir/pull/91]] in BAHIR for this, so 
closing this ticket. 

> Faster S3 file source with SQS
> --
>
> Key: SPARK-28124
> URL: https://issues.apache.org/jira/browse/SPARK-28124
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Abhishek Dixit
>Priority: Major
>
> Using FileStreamSource to read files from a S3 bucket has problems both in 
> terms of costs and latency:
>  * *Latency:* Listing all the files in S3 buckets every microbatch can be 
> both slow and resource intensive.
>  * *Costs:* Making List API requests to S3 every microbatch can be costly.
>  The solution is to use Amazon Simple Queue Service (SQS) which lets you find 
> new files written to S3 bucket without the need to list all the files every 
> microbatch.
> S3 buckets can be configured to send notification to an Amazon SQS Queue on 
> Object Create / Object Delete events. For details see AWS documentation here 
> [Configuring S3 Event 
> Notifications|https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html]
>  
> Spark can leverage this to find new files written to S3 bucket by reading 
> notifications from SQS queue instead of listing files every microbatch.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28124) Faster S3 file source with SQS

2019-08-07 Thread Abhishek Dixit (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902382#comment-16902382
 ] 

Abhishek Dixit edited comment on SPARK-28124 at 8/7/19 6:15 PM:


As suggested by [~gsomogyi], [~ste...@apache.org], [~zsxwing] I've opened a 
[pull request |[https://github.com/apache/bahir/pull/91]] in BAHIR for this, so 
closing this ticket. 


was (Author: abhishekd0907):
As suggested by [~gsomogyi], [~ste...@apache.org], [~zsxwing] I've opened a 
[pull request|[https://github.com/apache/bahir/pull/91]] in BAHIR for this, so 
closing this ticket. 

> Faster S3 file source with SQS
> --
>
> Key: SPARK-28124
> URL: https://issues.apache.org/jira/browse/SPARK-28124
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Abhishek Dixit
>Priority: Major
>
> Using FileStreamSource to read files from a S3 bucket has problems both in 
> terms of costs and latency:
>  * *Latency:* Listing all the files in S3 buckets every microbatch can be 
> both slow and resource intensive.
>  * *Costs:* Making List API requests to S3 every microbatch can be costly.
>  The solution is to use Amazon Simple Queue Service (SQS) which lets you find 
> new files written to S3 bucket without the need to list all the files every 
> microbatch.
> S3 buckets can be configured to send notification to an Amazon SQS Queue on 
> Object Create / Object Delete events. For details see AWS documentation here 
> [Configuring S3 Event 
> Notifications|https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html]
>  
> Spark can leverage this to find new files written to S3 bucket by reading 
> notifications from SQS queue instead of listing files every microbatch.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28124) Faster S3 file source with SQS

2019-08-07 Thread Abhishek Dixit (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Dixit resolved SPARK-28124.

Resolution: Feedback Received

As suggested by [~gsomogyi], [~ste...@apache.org], [~zsxwing] I've opened a 
[pull request|[https://github.com/apache/bahir/pull/91]] in BAHIR for this, so 
closing this ticket. 

> Faster S3 file source with SQS
> --
>
> Key: SPARK-28124
> URL: https://issues.apache.org/jira/browse/SPARK-28124
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Abhishek Dixit
>Priority: Major
>
> Using FileStreamSource to read files from a S3 bucket has problems both in 
> terms of costs and latency:
>  * *Latency:* Listing all the files in S3 buckets every microbatch can be 
> both slow and resource intensive.
>  * *Costs:* Making List API requests to S3 every microbatch can be costly.
>  The solution is to use Amazon Simple Queue Service (SQS) which lets you find 
> new files written to S3 bucket without the need to list all the files every 
> microbatch.
> S3 buckets can be configured to send notification to an Amazon SQS Queue on 
> Object Create / Object Delete events. For details see AWS documentation here 
> [Configuring S3 Event 
> Notifications|https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html]
>  
> Spark can leverage this to find new files written to S3 bucket by reading 
> notifications from SQS queue instead of listing files every microbatch.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28646) Allow usage of `count` only for parameterless aggregate function

2019-08-07 Thread Dylan Guedes (JIRA)
Dylan Guedes created SPARK-28646:


 Summary: Allow usage of `count` only for parameterless aggregate 
function
 Key: SPARK-28646
 URL: https://issues.apache.org/jira/browse/SPARK-28646
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Dylan Guedes


Currently, Spark allows calls to `count` even for non parameterless aggregate 
function. For example, the following query actually works:
{code:sql}SELECT count() OVER () FROM tenk1;{code}
In PgSQL, on the other hand, the following error is thrown:
{code:sql}ERROR:  count(*) must be used to call a parameterless aggregate 
function{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28645) Throw an error on window redefinition

2019-08-07 Thread Dylan Guedes (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dylan Guedes updated SPARK-28645:
-
Summary: Throw an error on window redefinition  (was: Block redefinition of 
window)

> Throw an error on window redefinition
> -
>
> Key: SPARK-28645
> URL: https://issues.apache.org/jira/browse/SPARK-28645
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dylan Guedes
>Priority: Major
>
> Currently in Spark one could redefine a window. For instance:
> {code:sql}select count(*) OVER w FROM tenk1 WINDOW w AS (ORDER BY unique1), w 
> AS (ORDER BY unique1);{code}
> The window `w` is defined two times. In PgSQL, on the other hand, a thrown 
> will happen:
> {code:sql}ERROR:  window "w" is already defined{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28645) Block redefinition of window

2019-08-07 Thread Dylan Guedes (JIRA)
Dylan Guedes created SPARK-28645:


 Summary: Block redefinition of window
 Key: SPARK-28645
 URL: https://issues.apache.org/jira/browse/SPARK-28645
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Dylan Guedes


Currently in Spark one could redefine a window. For instance:
{code:sql}select count(*) OVER w FROM tenk1 WINDOW w AS (ORDER BY unique1), w 
AS (ORDER BY unique1);{code}
The window `w` is defined two times. In PgSQL, on the other hand, a thrown will 
happen:
{code:sql}ERROR:  window "w" is already defined{code}

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28634) Failed to start SparkSession with Keytab file

2019-08-07 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902238#comment-16902238
 ] 

Marcelo Vanzin commented on SPARK-28634:


Ah. If you use {{--principal}} and {{--keytab}} this works.

The config name has changed in master and you're using the deprecated ones; the 
YARN client code removes them from the config in client mode, but only the new 
names:
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L769

For proper backwards compatibility it needs to remove the old names too. (Or 
make a change in the AM instead to ignore the keytab when running in client 
mode, which avoids the above hack.)

> Failed to start SparkSession with Keytab file 
> --
>
> Key: SPARK-28634
> URL: https://issues.apache.org/jira/browse/SPARK-28634
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> [user-etl@hermesdevour002-700165 spark-3.0.0-SNAPSHOT-bin-2.7.4]$ 
> bin/spark-sql --master yarn --conf 
> spark.yarn.keytab=/apache/spark-2.3.0-bin-2.7.3/conf/user-etl.keytab --conf 
> spark.yarn.principal=user-...@prod.example.com
> log4j:WARN No such property [maxFileSize] in 
> org.apache.log4j.rolling.RollingFileAppender.
> log4j:WARN No such property [maxBackupIndex] in 
> org.apache.log4j.rolling.RollingFileAppender.
> Exception in thread "main" org.apache.spark.SparkException: Application 
> application_1564558112805_1794 failed 2 times due to AM Container for 
> appattempt_1564558112805_1794_02 exited with  exitCode: 1
> For more detailed output, check the application tracking page: 
> https://0.0.0.0:8190/applicationhistory/app/application_1564558112805_1794 
> Then click on links to logs of each attempt.
> Diagnostics: Exception from container-launch.
> Container id: container_e1987_1564558112805_1794_02_01
> Exit code: 1
> Shell output: main : command provided 1
> main : run as user is user-etl
> main : requested yarn user is user-etl
> Getting exit code file...
> Creating script paths...
> Writing pid file...
> Writing to tmp file 
> /hadoop/2/yarn/local/nmPrivate/application_1564558112805_1794/container_e1987_1564558112805_1794_02_01/container_e1987_1564558112805_1794_02_01.pid.tmp
> Writing to cgroup task files...
> Creating local dirs...
> Launching container...
> Getting exit code file...
> Creating script paths...
> Container exited with a non-zero exit code 1. Error file: prelaunch.err.
> Last 4096 bytes of prelaunch.err :
> Last 4096 bytes of stderr :
> log4j:WARN No such property [maxFileSize] in 
> org.apache.log4j.rolling.RollingFileAppender.
> log4j:WARN No such property [maxBackupIndex] in 
> org.apache.log4j.rolling.RollingFileAppender.
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/hadoop/2/yarn/local/usercache/user-etl/filecache/58/__spark_libs__4358879230136591830.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/apache/releases/hbase-1.1.2.2.6.4.1/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/apache/releases/hadoop-2.7.3.2.6.4.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> Exception in thread "main" org.apache.spark.SparkException: Keytab file: 
> /apache/spark-2.3.0-bin-2.7.3/conf/user-etl.keytab does not exist
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.loginUserFromKeytab(SparkHadoopUtil.scala:131)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:846)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:889)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
> Failing this attempt. Failing the application.
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:95)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:185)
>   at org.apache.spark.SparkContext.(SparkContext.scala:509)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2466)
>   at 
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:948)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql

[jira] [Comment Edited] (SPARK-28634) Failed to start SparkSession with Keytab file

2019-08-07 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902238#comment-16902238
 ] 

Marcelo Vanzin edited comment on SPARK-28634 at 8/7/19 4:48 PM:


Ah. If you use {{\-\-principal}} and {{\-\-keytab}} this works.

The config name has changed in master and you're using the deprecated ones; the 
YARN client code removes them from the config in client mode, but only the new 
names:
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L769

For proper backwards compatibility it needs to remove the old names too. (Or 
make a change in the AM instead to ignore the keytab when running in client 
mode, which avoids the above hack.)


was (Author: vanzin):
Ah. If you use {{--principal}} and {{--keytab}} this works.

The config name has changed in master and you're using the deprecated ones; the 
YARN client code removes them from the config in client mode, but only the new 
names:
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L769

For proper backwards compatibility it needs to remove the old names too. (Or 
make a change in the AM instead to ignore the keytab when running in client 
mode, which avoids the above hack.)

> Failed to start SparkSession with Keytab file 
> --
>
> Key: SPARK-28634
> URL: https://issues.apache.org/jira/browse/SPARK-28634
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> [user-etl@hermesdevour002-700165 spark-3.0.0-SNAPSHOT-bin-2.7.4]$ 
> bin/spark-sql --master yarn --conf 
> spark.yarn.keytab=/apache/spark-2.3.0-bin-2.7.3/conf/user-etl.keytab --conf 
> spark.yarn.principal=user-...@prod.example.com
> log4j:WARN No such property [maxFileSize] in 
> org.apache.log4j.rolling.RollingFileAppender.
> log4j:WARN No such property [maxBackupIndex] in 
> org.apache.log4j.rolling.RollingFileAppender.
> Exception in thread "main" org.apache.spark.SparkException: Application 
> application_1564558112805_1794 failed 2 times due to AM Container for 
> appattempt_1564558112805_1794_02 exited with  exitCode: 1
> For more detailed output, check the application tracking page: 
> https://0.0.0.0:8190/applicationhistory/app/application_1564558112805_1794 
> Then click on links to logs of each attempt.
> Diagnostics: Exception from container-launch.
> Container id: container_e1987_1564558112805_1794_02_01
> Exit code: 1
> Shell output: main : command provided 1
> main : run as user is user-etl
> main : requested yarn user is user-etl
> Getting exit code file...
> Creating script paths...
> Writing pid file...
> Writing to tmp file 
> /hadoop/2/yarn/local/nmPrivate/application_1564558112805_1794/container_e1987_1564558112805_1794_02_01/container_e1987_1564558112805_1794_02_01.pid.tmp
> Writing to cgroup task files...
> Creating local dirs...
> Launching container...
> Getting exit code file...
> Creating script paths...
> Container exited with a non-zero exit code 1. Error file: prelaunch.err.
> Last 4096 bytes of prelaunch.err :
> Last 4096 bytes of stderr :
> log4j:WARN No such property [maxFileSize] in 
> org.apache.log4j.rolling.RollingFileAppender.
> log4j:WARN No such property [maxBackupIndex] in 
> org.apache.log4j.rolling.RollingFileAppender.
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/hadoop/2/yarn/local/usercache/user-etl/filecache/58/__spark_libs__4358879230136591830.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/apache/releases/hbase-1.1.2.2.6.4.1/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/apache/releases/hadoop-2.7.3.2.6.4.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> Exception in thread "main" org.apache.spark.SparkException: Keytab file: 
> /apache/spark-2.3.0-bin-2.7.3/conf/user-etl.keytab does not exist
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.loginUserFromKeytab(SparkHadoopUtil.scala:131)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:846)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:889)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
> Failing this attempt. Failing the application.
>   at 
> org.apache.spark.scheduler.cluster.YarnC

[jira] [Updated] (SPARK-28644) Port HIVE-10646: ColumnValue does not handle NULL_TYPE

2019-08-07 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28644:

Description: 
Port HIVE-10646 to fix Hive 0.12's beeline can not handle NULL_TYPE:
{code:sql}
Connected to: Hive (version 3.0.0-SNAPSHOT)
Driver: Hive (version 0.12.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 0.12.0 by Apache Hive
0: jdbc:hive2://localhost:1> select null;
org.apache.thrift.transport.TTransportException
at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at 
org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:346)
at 
org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:423)
at 
org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:405)
{code}

Server log:
{noformat}
19/08/07 09:34:07 ERROR TThreadPoolServer: Error occurred during processing of 
message.
java.lang.NullPointerException
at 
org.apache.hive.service.cli.thrift.TRow$TRowStandardScheme.write(TRow.java:388)
at 
org.apache.hive.service.cli.thrift.TRow$TRowStandardScheme.write(TRow.java:338)
at org.apache.hive.service.cli.thrift.TRow.write(TRow.java:288)
at 
org.apache.hive.service.cli.thrift.TRowSet$TRowSetStandardScheme.write(TRowSet.java:605)
at 
org.apache.hive.service.cli.thrift.TRowSet$TRowSetStandardScheme.write(TRowSet.java:525)
at org.apache.hive.service.cli.thrift.TRowSet.write(TRowSet.java:455)
at 
org.apache.hive.service.cli.thrift.TFetchResultsResp$TFetchResultsRespStandardScheme.write(TFetchResultsResp.java:550)
at 
org.apache.hive.service.cli.thrift.TFetchResultsResp$TFetchResultsRespStandardScheme.write(TFetchResultsResp.java:486)
at 
org.apache.hive.service.cli.thrift.TFetchResultsResp.write(TFetchResultsResp.java:412)
at 
org.apache.hive.service.cli.thrift.TCLIService$FetchResults_result$FetchResults_resultStandardScheme.write(TCLIService.java:13192)
at 
org.apache.hive.service.cli.thrift.TCLIService$FetchResults_result$FetchResults_resultStandardScheme.write(TCLIService.java:13156)
at 
org.apache.hive.service.cli.thrift.TCLIService$FetchResults_result.write(TCLIService.java:13107)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:58)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at 
org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:819)
{noformat}



  was:
Port HIVE-10646 to fix Hive 0.12's beeline can not select null:
{code:sql}
Connected to: Hive (version 3.0.0-SNAPSHOT)
Driver: Hive (version 0.12.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 0.12.0 by Apache Hive
0: jdbc:hive2://localhost:1> select null;
org.apache.thrift.transport.TTransportException
at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at 
org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:346)
at 
org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:423)
at 
org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:405)
{code}

Server log:
{noformat}
19/08/07 09:34:07 ERROR TThreadPoolServer: Error occurred during processing of 
message.
java.lang.NullPointerException
at 
org.apache.hive.service.cli.thrift.TRow$TRowStandardScheme.write(TRow.java:388)
at 
org.apache.hive.service.cli.thrift.TRow$TRowStandardScheme.write(TRow.java:338)
at org.apache.hive.service.cli.thrift.TRow.write(TRow.java:288)
at 
org.apache.hive.service.cli.thrift.TRowSet$TRowSetStandardScheme.write(TRowSet.java:605)
at 
org.apache.hive.service.cli.thrift.TRowSet$TRowSetStandardScheme.write(TRowSet.java:525)
at org.apache.hive.service.cli.thrift.TRowSet.write(TRowSet.java:455)
at 
org.apache.hive.service.cli.thrift.TFetchResultsResp$TFetchResultsRespStandardScheme.write(TFetchResultsResp.java:550)
at 
org.apache.hive.service.cli.thrift.TFetchResultsResp$TFetchResultsRespStandardScheme.write(TFetchResultsResp.java:486)
at 
org.apache.hive.service.cli.thrift.TFetchResultsResp.write(TFetchResultsResp.java:412)
at 
org.apache.hive.service.cli.thrift.TCLIService$FetchResults_result$FetchResults_resultStandardScheme.write(TCLIService.java:13

[jira] [Updated] (SPARK-28644) Port HIVE-10646: ColumnValue does not handle NULL_TYPE

2019-08-07 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28644:

Description: 
Port HIVE-10646 to fix Hive 0.12's JDBC client can not handle NULL_TYPE:
{code:sql}
Connected to: Hive (version 3.0.0-SNAPSHOT)
Driver: Hive (version 0.12.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 0.12.0 by Apache Hive
0: jdbc:hive2://localhost:1> select null;
org.apache.thrift.transport.TTransportException
at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at 
org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:346)
at 
org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:423)
at 
org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:405)
{code}

Server log:
{noformat}
19/08/07 09:34:07 ERROR TThreadPoolServer: Error occurred during processing of 
message.
java.lang.NullPointerException
at 
org.apache.hive.service.cli.thrift.TRow$TRowStandardScheme.write(TRow.java:388)
at 
org.apache.hive.service.cli.thrift.TRow$TRowStandardScheme.write(TRow.java:338)
at org.apache.hive.service.cli.thrift.TRow.write(TRow.java:288)
at 
org.apache.hive.service.cli.thrift.TRowSet$TRowSetStandardScheme.write(TRowSet.java:605)
at 
org.apache.hive.service.cli.thrift.TRowSet$TRowSetStandardScheme.write(TRowSet.java:525)
at org.apache.hive.service.cli.thrift.TRowSet.write(TRowSet.java:455)
at 
org.apache.hive.service.cli.thrift.TFetchResultsResp$TFetchResultsRespStandardScheme.write(TFetchResultsResp.java:550)
at 
org.apache.hive.service.cli.thrift.TFetchResultsResp$TFetchResultsRespStandardScheme.write(TFetchResultsResp.java:486)
at 
org.apache.hive.service.cli.thrift.TFetchResultsResp.write(TFetchResultsResp.java:412)
at 
org.apache.hive.service.cli.thrift.TCLIService$FetchResults_result$FetchResults_resultStandardScheme.write(TCLIService.java:13192)
at 
org.apache.hive.service.cli.thrift.TCLIService$FetchResults_result$FetchResults_resultStandardScheme.write(TCLIService.java:13156)
at 
org.apache.hive.service.cli.thrift.TCLIService$FetchResults_result.write(TCLIService.java:13107)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:58)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at 
org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:819)
{noformat}



  was:
Port HIVE-10646 to fix Hive 0.12's beeline can not handle NULL_TYPE:
{code:sql}
Connected to: Hive (version 3.0.0-SNAPSHOT)
Driver: Hive (version 0.12.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 0.12.0 by Apache Hive
0: jdbc:hive2://localhost:1> select null;
org.apache.thrift.transport.TTransportException
at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at 
org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:346)
at 
org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:423)
at 
org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:405)
{code}

Server log:
{noformat}
19/08/07 09:34:07 ERROR TThreadPoolServer: Error occurred during processing of 
message.
java.lang.NullPointerException
at 
org.apache.hive.service.cli.thrift.TRow$TRowStandardScheme.write(TRow.java:388)
at 
org.apache.hive.service.cli.thrift.TRow$TRowStandardScheme.write(TRow.java:338)
at org.apache.hive.service.cli.thrift.TRow.write(TRow.java:288)
at 
org.apache.hive.service.cli.thrift.TRowSet$TRowSetStandardScheme.write(TRowSet.java:605)
at 
org.apache.hive.service.cli.thrift.TRowSet$TRowSetStandardScheme.write(TRowSet.java:525)
at org.apache.hive.service.cli.thrift.TRowSet.write(TRowSet.java:455)
at 
org.apache.hive.service.cli.thrift.TFetchResultsResp$TFetchResultsRespStandardScheme.write(TFetchResultsResp.java:550)
at 
org.apache.hive.service.cli.thrift.TFetchResultsResp$TFetchResultsRespStandardScheme.write(TFetchResultsResp.java:486)
at 
org.apache.hive.service.cli.thrift.TFetchResultsResp.write(TFetchResultsResp.java:412)
at 
org.apache.hive.service.cli.thrift.TCLIService$FetchResults_result$FetchResults_resultStandardScheme.write(TCLIServic

[jira] [Created] (SPARK-28644) Port HIVE-10646: ColumnValue does not handle NULL_TYPE

2019-08-07 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-28644:
---

 Summary: Port HIVE-10646: ColumnValue does not handle NULL_TYPE
 Key: SPARK-28644
 URL: https://issues.apache.org/jira/browse/SPARK-28644
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


Port HIVE-10646 to fix Hive 0.12's beeline can not select null:
{code:sql}
Connected to: Hive (version 3.0.0-SNAPSHOT)
Driver: Hive (version 0.12.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 0.12.0 by Apache Hive
0: jdbc:hive2://localhost:1> select null;
org.apache.thrift.transport.TTransportException
at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at 
org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:346)
at 
org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:423)
at 
org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:405)
{code}

Server log:
{noformat}
19/08/07 09:34:07 ERROR TThreadPoolServer: Error occurred during processing of 
message.
java.lang.NullPointerException
at 
org.apache.hive.service.cli.thrift.TRow$TRowStandardScheme.write(TRow.java:388)
at 
org.apache.hive.service.cli.thrift.TRow$TRowStandardScheme.write(TRow.java:338)
at org.apache.hive.service.cli.thrift.TRow.write(TRow.java:288)
at 
org.apache.hive.service.cli.thrift.TRowSet$TRowSetStandardScheme.write(TRowSet.java:605)
at 
org.apache.hive.service.cli.thrift.TRowSet$TRowSetStandardScheme.write(TRowSet.java:525)
at org.apache.hive.service.cli.thrift.TRowSet.write(TRowSet.java:455)
at 
org.apache.hive.service.cli.thrift.TFetchResultsResp$TFetchResultsRespStandardScheme.write(TFetchResultsResp.java:550)
at 
org.apache.hive.service.cli.thrift.TFetchResultsResp$TFetchResultsRespStandardScheme.write(TFetchResultsResp.java:486)
at 
org.apache.hive.service.cli.thrift.TFetchResultsResp.write(TFetchResultsResp.java:412)
at 
org.apache.hive.service.cli.thrift.TCLIService$FetchResults_result$FetchResults_resultStandardScheme.write(TCLIService.java:13192)
at 
org.apache.hive.service.cli.thrift.TCLIService$FetchResults_result$FetchResults_resultStandardScheme.write(TCLIService.java:13156)
at 
org.apache.hive.service.cli.thrift.TCLIService$FetchResults_result.write(TCLIService.java:13107)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:58)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at 
org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:819)
{noformat}





--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28472) Add a test for testing different protocol versions

2019-08-07 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-28472.
-
   Resolution: Fixed
 Assignee: Yuming Wang
Fix Version/s: 3.0.0

> Add a test for testing different protocol versions
> --
>
> Key: SPARK-28472
> URL: https://issues.apache.org/jira/browse/SPARK-28472
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28643) Use Java 8 time API in Dataset.show

2019-08-07 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-28643:
--

 Summary: Use Java 8 time API in Dataset.show
 Key: SPARK-28643
 URL: https://issues.apache.org/jira/browse/SPARK-28643
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Currently, Dataset.show collects DATE and TIMESTAMP columns by converting them 
to java.sql.Date and java.sql.Timestamp that can cause some problems while 
converting the Java types to strings. For example: 
https://github.com/apache/spark/pull/25336#discussion_r310110759



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27495) SPIP: Support Stage level resource configuration and scheduling

2019-08-07 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-27495:
--
Description: 
*Q1.* What are you trying to do? Articulate your objectives using absolutely no 
jargon.

Objectives:
 # Allow users to specify task and executor resource requirements at the stage 
level. 
 # Spark will use the stage level requirements to acquire the necessary 
resources/executors and schedule tasks based on the per stage requirements.

Many times users have different resource requirements for different stages of 
their application so they want to be able to configure resources at the stage 
level. For instance, you have a single job that has 2 stages. The first stage 
does some  ETL which requires a lot of tasks, each with a small amount of 
memory and 1 core each. Then you have a second stage where you feed that ETL 
data into an ML algorithm. The second stage only requires a few executors but 
each executor needs a lot of memory, GPUs, and many cores.  This feature allows 
the user to specify the task and executor resource requirements for the ETL 
Stage and then change them for the ML stage of the job.  

Resources include cpu, memory (on heap, overhead, pyspark, and off heap), and 
extra Resources (GPU/FPGA/etc). It has the potential to allow for other things 
like limiting the number of tasks per stage, specifying other parameters for 
things like shuffle, etc. Initially I would propose we only support resources 
as they are now. So Task resources would be cpu and other resources (GPU, 
FPGA), that way we aren't adding in extra scheduling things at this point.  
Executor resources would be cpu, memory, and extra resources(GPU,FPGA, etc). 
Changing the executor resources will rely on dynamic allocation being enabled.

Main use cases:
 # ML use case where user does ETL and feeds it into an ML algorithm where it’s 
using the RDD API. This should work with barrier scheduling as well once it 
supports dynamic allocation.
 # Spark internal use by catalyst. Catalyst could control the stage level 
resources as it finds the need to change it between stages for different 
optimizations. For instance, with the new columnar plugin to the query planner 
we can insert stages into the plan that would change running something on the 
CPU in row format to running it on the GPU in columnar format. This API would 
allow the planner to make sure the stages that run on the GPU get the 
corresponding GPU resources it needs to run. Another possible use case for 
catalyst is that it would allow catalyst to add in more optimizations to where 
the user doesn’t need to configure container sizes at all. If the 
optimizer/planner can handle that for the user, everyone wins.

This SPIP focuses on the RDD API but we don’t exclude the Dataset API. I think 
the DataSet API will require more changes because it specifically hides the RDD 
from the users via the plans and catalyst can optimize the plan and insert 
things into the plan. The only way I’ve found to make this work with the 
Dataset API would be modifying all the plans to be able to get the resource 
requirements down into where it creates the RDDs, which I believe would be a 
lot of change.  If other people know better options, it would be great to hear 
them.

*Q2.* What problem is this proposal NOT designed to solve?

The initial implementation is not going to add Dataset APIs.

We are starting with allowing users to specify a specific set of task/executor 
resources and plan to design it to be extendable, but the first implementation 
will not support changing generic SparkConf configs and only specific limited 
resources.

This initial version will have a programmatic API for specifying the resource 
requirements per stage, we can add the ability to perhaps have profiles in the 
configs later if its useful.

*Q3.* How is it done today, and what are the limits of current practice?

Currently this is either done by having multiple spark jobs or requesting 
containers with the max resources needed for any part of the job.  To do this 
today, you can break it into separate jobs where each job requests the 
corresponding resources needed, but then you have to write the data out 
somewhere and then read it back in between jobs.  This is going to take longer 
as well as require that job coordination between those to make sure everything 
works smoothly. Another option would be to request executors with your largest 
need up front and potentially waste those resources when they aren't being 
used, which in turn wastes money. For instance, for an ML application where it 
does ETL first, many times people request containers with GPUs and the GPUs sit 
idle while the ETL is happening. This is wasting those GPU resources and in 
turn money because those GPUs could have been used by other applications until 
they were really needed.  

Note for the catalyst internal 

[jira] [Updated] (SPARK-27495) SPIP: Support Stage level resource configuration and scheduling

2019-08-07 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-27495:
--
Description: 
*Q1.* What are you trying to do? Articulate your objectives using absolutely no 
jargon.

Objectives:
 # Allow users to specify task and executor resource requirements at the stage 
level. 
 # Spark will use the stage level requirements to acquire the necessary 
resources/executors and schedule tasks based on the per stage requirements.

Many times users have different resource requirements for different stages of 
their application so they want to be able to configure resources at the stage 
level. For instance, you have a single job that has 2 stages. The first stage 
does some  ETL which requires a lot of tasks, each with a small amount of 
memory and 1 core each. Then you have a second stage where you feed that ETL 
data into an ML algorithm. The second stage only requires a few executors but 
each executor needs a lot of memory, GPUs, and many cores.  This feature allows 
the user to specify the task and executor resource requirements for the ETL 
Stage and then change them for the ML stage of the job.

Resources include cpu, memory (on heap, overhead, pyspark, and off heap), and 
extra Resources (GPU/FPGA/etc). It has the potential to allow for other things 
like limiting the number of tasks per stage, specifying other parameters for 
things like shuffle, etc. Initially I would propose we only support resources 
as they are now. So Task resources would be cpu and other resources (GPU, 
FPGA), that way we aren't adding in extra scheduling things at this point.  
Executor resources would be cpu, memory, and extra resources(GPU,FPGA, etc). 
Changing the executor resources will rely on dynamic allocation being enabled.

Main use cases:
 # ML use case where user does ETL and feeds it into an ML algorithm where it’s 
using the RDD API. This should work with barrier scheduling as well once it 
supports dynamic allocation.
 # Spark internal use by catalyst. Catalyst could control the stage level 
resources as it finds the need to change it between stages for different 
optimizations. For instance, with the new columnar plugin to the query planner 
we can insert stages into the plan that would change running something on the 
CPU in row format to running it on the GPU in columnar format. This API would 
allow the planner to make sure the stages that run on the GPU get the 
corresponding GPU resources it needs to run. Another possible use case for 
catalyst is that it would allow catalyst to add in more optimizations to where 
the user doesn’t need to configure container sizes at all. If the 
optimizer/planner can handle that for the user, everyone wins.

This SPIP focuses on the RDD API but we don’t exclude the Dataset API. I think 
the DataSet API will require more changes because it specifically hides the RDD 
from the users via the plans and catalyst can optimize the plan and insert 
things into the plan. The only way I’ve found to make this work with the 
Dataset API would be modifying all the plans to be able to get the resource 
requirements down into where it creates the RDDs, which I believe would be a 
lot of change.  If other people know better options, it would be great to hear 
them.

*Q2.* What problem is this proposal NOT designed to solve?

The initial implementation is not going to add Dataset APIs.

We are starting with allowing users to specify a specific set of task/executor 
resources and plan to design it to be extendable, but the first implementation 
will not support changing generic SparkConf configs and only specific limited 
resources.

This initial version will have a programmatic API for specifying the resource 
requirements per stage, we can add the ability to perhaps have profiles in the 
configs later if its useful.

*Q3.* How is it done today, and what are the limits of current practice?

Currently this is either done by having multiple spark jobs or requesting 
containers with the max resources needed for any part of the job.  To do this 
today, you can break it into separate jobs where each job requests the 
corresponding resources needed, but then you have to write the data out 
somewhere and then read it back in between jobs.  This is going to take longer 
as well as require that job coordination between those to make sure everything 
works smoothly. Another option would be to request executors with your largest 
need up front and potentially waste those resources when they aren't being 
used, which in turn wastes money. For instance, for an ML application where it 
does ETL first, many times people request containers with GPUs and the GPUs sit 
idle while the ETL is happening. This is wasting those GPU resources and in 
turn money because those GPUs could have been used by other applications until 
they were really needed.  

Note for the catalyst internal us

[jira] [Commented] (SPARK-27495) SPIP: Support Stage level resource configuration and scheduling

2019-08-07 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902076#comment-16902076
 ] 

Thomas Graves commented on SPARK-27495:
---

Thanks for the comments.
 # I was trying to leave the API open to allow both.  Not sure you had a chance 
to look at the design/api doc, but with the ResourceProfile I have a .require 
function, that would be a must have and then mentioned in there we could add a 
.prefer function that would be more of a hint or a nice to have.  Also I should 
have clarified (and I'll update the description) that in this initial 
implementation I only intend to support scheduling on the same things we do 
now. So for Tasks it would only be CPU and other resources (GPU, FPGa, etc). 
For Executor resource we would support requesting cpu, memory (offheap, onheap, 
overhead, pyspark), and other resources.  So I wouldn't add support for task 
memory per stage at this point. It would also take other changes to enforce a 
task memory usage as well.  But your point is valid if that gets added so I 
think we need to be careful about what we allow for each. 
 # So I think you have brought up 2 issues here.
 ## One is how we do resource merging/conflict resolution. My proposal was to 
simply look at the RDD's within the stage and merge any resource profiles.  
There are multiple ways we could do the merge, one is naively just use a max, 
but as you pointed out some operations might be better done as a sum, so I was 
hoping to be able to either make it configurable with perhaps some defaults for 
different operations. I guess the question is if we need that for the initial 
implementation or we just make sure to design for it.  I think you are also 
proposing to use the parents RDD (do you just mean parents within a stage or 
always?) but I think that all depends on how we define it and how the user is 
going to expect it.  I might have a very large RDD that needs processing that 
shrinks down to a very small one.  To process that large RDD I need a lot of 
resources, but once its shrunk I don't need that many.  So the child of that 
wouldn't need the same resources as its parent.  That is why I was proposing to 
set the resources specifically for that RDD. We need to handle those that have 
shuffle, like groupBy to make sure its carried over. 
 ## so the other things you brought up I think it talking more about the task 
resources here since you are saying per partition, correct?  There are 
definitely still issues even within the same RDD that some partitions may be 
larger then others.  I'm not really trying to solve that problem here.  This 
feature may open it up so that Spark can to more intelligent things there, but 
this initial round users would still have to configure the resources for the 
worse case.  I'm essentially leaving the configs the same as what they have 
now. Users I think are used to defining the executor resources based on the 
worse case any task within the worst stage will consume, this just allows them 
to go one step further to control it per stage.  It also hopefully opens it up 
to do more of that configuration for them in the future.   For now, the idea is 
why would only configure the resources for the stages they want, all other 
stages would default back to the global configs.  I'll clarify this in the spip 
as well. We could change this but I think it gets hard for the user to know the 
scoping.

> SPIP: Support Stage level resource configuration and scheduling
> ---
>
> Key: SPARK-27495
> URL: https://issues.apache.org/jira/browse/SPARK-27495
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
> Objectives:
>  # Allow users to specify task and executor resource requirements at the 
> stage level. 
>  # Spark will use the stage level requirements to acquire the necessary 
> resources/executors and schedule tasks based on the per stage requirements.
> Many times users have different resource requirements for different stages of 
> their application so they want to be able to configure resources at the stage 
> level. For instance, you have a single job that has 2 stages. The first stage 
> does some  ETL which requires a lot of tasks, each with a small amount of 
> memory and 1 core each. Then you have a second stage where you feed that ETL 
> data into an ML algorithm. The second stage only requires a few executors but 
> each executor needs a lot of memory, GPUs, and many cores.  This feature 
> allows the user to specify the task and executor resource requirements for 
> the ETL Stage and then 

[jira] [Resolved] (SPARK-26667) Add `Scanning Input Table` to Performance Tuning Guide

2019-08-07 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26667.
---
Resolution: Won't Fix

> Add `Scanning Input Table` to Performance Tuning Guide
> --
>
> Key: SPARK-26667
> URL: https://issues.apache.org/jira/browse/SPARK-26667
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.1, 3.0.0
>Reporter: Deegue
>Priority: Minor
>
> We can use `CombineTextInputFormat` instead of `TextInputFormat` and set 
> configurations to increase the speed while reading a table. 
> There's no need to add spark configurations,  
> [PR#23506|https://github.com/apache/spark/pull/23506], so add it to the 
> Performance Tuning.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28595) explain should not trigger partition listing

2019-08-07 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28595.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25328
[https://github.com/apache/spark/pull/25328]

> explain should not trigger partition listing
> 
>
> Key: SPARK-28595
> URL: https://issues.apache.org/jira/browse/SPARK-28595
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28642) Hide credentials in show create table

2019-08-07 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28642:

Description: 
{code:sql}
spark-sql> show create table mysql_federated_sample;
CREATE TABLE `mysql_federated_sample` (`TBL_ID` BIGINT, `CREATE_TIME` INT, 
`DB_ID` BIGINT, `LAST_ACCESS_TIME` INT, `OWNER` STRING, `RETENTION` INT, 
`SD_ID` BIGINT, `TBL_NAME` STRING, `TBL_TYPE` STRING, `VIEW_EXPANDED_TEXT` 
STRING, `VIEW_ORIGINAL_TEXT` STRING, `IS_REWRITE_ENABLED` BOOLEAN)
USING org.apache.spark.sql.jdbc
OPTIONS (
`url` 'jdbc:mysql://localhost/hive?user=root&password=mypasswd',
`driver` 'com.mysql.jdbc.Driver',
`dbtable` 'TBLS'
)
{code}

  was:
{code:java}
spark-sql> show create table mysql_federated_sample;
CREATE TABLE `mysql_federated_sample` (`TBL_ID` BIGINT, `CREATE_TIME` INT, 
`DB_ID` BIGINT, `LAST_ACCESS_TIME` INT, `OWNER` STRING, `RETENTION` INT, 
`SD_ID` BIGINT, `TBL_NAME` STRING, `TBL_TYPE` STRING, `VIEW_EXPANDED_TEXT` 
STRING, `VIEW_ORIGINAL_TEXT` STRING, `IS_REWRITE_ENABLED` BOOLEAN)
USING org.apache.spark.sql.jdbc
OPTIONS (
`url` 'jdbc:mysql://localhost/hive?user=root&password=mypasswd',
`driver` 'com.mysql.jdbc.Driver',
`dbtable` 'TBLS'
)
{code}


> Hide credentials in show create table
> -
>
> Key: SPARK-28642
> URL: https://issues.apache.org/jira/browse/SPARK-28642
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:sql}
> spark-sql> show create table mysql_federated_sample;
> CREATE TABLE `mysql_federated_sample` (`TBL_ID` BIGINT, `CREATE_TIME` INT, 
> `DB_ID` BIGINT, `LAST_ACCESS_TIME` INT, `OWNER` STRING, `RETENTION` INT, 
> `SD_ID` BIGINT, `TBL_NAME` STRING, `TBL_TYPE` STRING, `VIEW_EXPANDED_TEXT` 
> STRING, `VIEW_ORIGINAL_TEXT` STRING, `IS_REWRITE_ENABLED` BOOLEAN)
> USING org.apache.spark.sql.jdbc
> OPTIONS (
> `url` 'jdbc:mysql://localhost/hive?user=root&password=mypasswd',
> `driver` 'com.mysql.jdbc.Driver',
> `dbtable` 'TBLS'
> )
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28642) Hide credentials in show create table

2019-08-07 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-28642:
---

 Summary: Hide credentials in show create table
 Key: SPARK-28642
 URL: https://issues.apache.org/jira/browse/SPARK-28642
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


{code:java}
spark-sql> show create table mysql_federated_sample;
CREATE TABLE `mysql_federated_sample` (`TBL_ID` BIGINT, `CREATE_TIME` INT, 
`DB_ID` BIGINT, `LAST_ACCESS_TIME` INT, `OWNER` STRING, `RETENTION` INT, 
`SD_ID` BIGINT, `TBL_NAME` STRING, `TBL_TYPE` STRING, `VIEW_EXPANDED_TEXT` 
STRING, `VIEW_ORIGINAL_TEXT` STRING, `IS_REWRITE_ENABLED` BOOLEAN)
USING org.apache.spark.sql.jdbc
OPTIONS (
`url` 'jdbc:mysql://localhost/hive?user=root&password=mypasswd',
`driver` 'com.mysql.jdbc.Driver',
`dbtable` 'TBLS'
)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28641) MicroBatchExecution committed offsets greater than available offsets

2019-08-07 Thread MariaCarrie (JIRA)
MariaCarrie created SPARK-28641:
---

 Summary: MicroBatchExecution committed offsets greater than 
available offsets
 Key: SPARK-28641
 URL: https://issues.apache.org/jira/browse/SPARK-28641
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.3.1
 Environment: HDP --> 3.0.0

Spark --> 2.3.1

Kafka --> 2.1.1
Reporter: MariaCarrie


I use structure-streaming to consume kafka data, Trigger Type is default and 
checkpoint is enabled, but looking at the log, I find the structure-streaming 
data before processing, the application log is as follows:

 

^19/07/31 15:25:50 INFO KafkaSource: GetBatch called with start = 
Some(\{"dop_dvi_formatted-send_pus":{"2":13978245,"4":13978260,"1":13978249,"3":13978233,"0":13978242}}),
 end = 
\{"dop_dvi_formatted-send_pus":{"2":13978245,"4":9053058,"1":13978249,"3":13978233,"0":13978242}}^
^19/07/31 15:25:50 INFO KafkaSource: Partitions added: Map()^
^19/07/31 15:25:50 WARN KafkaSource: Partition dop_dvi_formatted-send_pus-4's 
offset was changed from 13978260 to 9053058, some data may have been missed.^ 
^Some data may have been lost because they are not available in Kafka any more; 
either the^
 ^data was aged out by Kafka or the topic may have been deleted before all the 
data in the^
 ^topic was processed. If you want your streaming query to fail on such cases, 
set the source^
 ^option "failOnDataLoss" to "true".^

 

I see that when you get the latestOffsets they are compared with the 
committedOffsets to see if they are newData.

 

^private def dataAvailable: Boolean = {^
 ^availableOffsets.exists {^
 ^case (source, available) =>^
 ^committedOffsets^
 ^.get(source)^
 ^.map(committed => committed != available)^
 ^.getOrElse(true)^
 ^}^
 ^}^

 

I think it is kafka appeared what problem, cause the fetchLatestOffsets methods 
returned earliestOffsets. However, the data was successfully processed and 
committed. Whether or not it can be determined in the dataAvailable method, if 
availableOffsets has been commited, the batch will no longer be marked as 
newData.

 

I don't know what I think is correct, if continue processing earliestOffsets, 
then the structured-streaming can't timely corresponding, I'm glad to receive 
any suggestion!

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11248) Spark hivethriftserver is using the wrong user to while getting HDFS permissions

2019-08-07 Thread angerszhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901908#comment-16901908
 ] 

angerszhu commented on SPARK-11248:
---

I have make a patch for this problem.    
[https://github.com/apache/spark/pull/25201] 

> Spark hivethriftserver is using the wrong user to while getting HDFS 
> permissions
> 
>
> Key: SPARK-11248
> URL: https://issues.apache.org/jira/browse/SPARK-11248
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 2.1.1, 2.2.0
>Reporter: Trystan Leftwich
>Priority: Major
>
> While running spark as a hivethrift-server via Yarn Spark will use the user 
> running the Hivethrift server rather than the user connecting via JDBC to 
> check HDFS perms.
> i.e.
> In HDFS the perms are
> rwx--   3 testuser testuser /user/testuser/table/testtable
> And i connect via beeline as user testuser
> beeline -u 'jdbc:hive2://localhost:10511' -n 'testuser' -p ''
> If i try to hit that table
> select count(*) from test_table;
> I get the following error
> Error: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch 
> table test_table. java.security.AccessControlException: Permission denied: 
> user=hive, access=READ, 
> inode="/user/testuser/table/testtable":testuser:testuser:drwxr-x--x
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:185)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6777)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPathAccess(FSNamesystem.java:6702)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:9529)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1516)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1433)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) 
> (state=,code=0)
> I have the following in set in hive-site.xml so it should be using the 
> correct user.
> 
>   hive.server2.enable.doAs
>   true
> 
> 
>   hive.metastore.execute.setugi
>   true
> 
> 
> This works correctly in hive.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

2019-08-07 Thread angerszhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901847#comment-16901847
 ] 

angerszhu edited comment on SPARK-21918 at 8/7/19 7:58 AM:
---

.    [https://github.com/apache/spark/pull/25201]  

 

gentle ping [~toopt4] [~huLiu] [~junzhang] [~fengchaoge] [~gss2002] 
[~dapengsun] [~shridharama], I have make a patch for this problem but a little  
rude method. 


was (Author: angerszhuuu):
.    [https://github.com/apache/spark/pull/25201]  

 

gentle ping [~toopt4] [~huLiu] [~junzhang] [~fengchaoge], I have make a patch 
for this problem but a little  rude method. 

> HiveClient shouldn't share Hive object between different thread
> ---
>
> Key: SPARK-21918
> URL: https://issues.apache.org/jira/browse/SPARK-21918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hu Liu,
>Priority: Major
>
> I'm testing the spark thrift server and found that all the DDL statements are 
> run by user hive even if hive.server2.enable.doAs=true
> The root cause is that Hive object is shared between different thread in 
> HiveClientImpl
> {code:java}
>   private def client: Hive = {
> if (clientLoader.cachedHive != null) {
>   clientLoader.cachedHive.asInstanceOf[Hive]
> } else {
>   val c = Hive.get(conf)
>   clientLoader.cachedHive = c
>   c
> }
>   }
> {code}
> But in impersonation mode, we should just share the Hive object inside the 
> thread so that the  metastore client in Hive could be associated with right 
> user.
> we can  pass the Hive object of parent thread to child thread when running 
> the sql to fix it
> I have already had a initial patch for review and I'm glad to work on it if 
> anyone could assign it to me.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

2019-08-07 Thread angerszhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901847#comment-16901847
 ] 

angerszhu edited comment on SPARK-21918 at 8/7/19 7:55 AM:
---

.    [https://github.com/apache/spark/pull/25201]  

 

gentle ping [~toopt4] [~huLiu] [~junzhang] [~fengchaoge], I have make a patch 
for this problem but a little  rude method. 


was (Author: angerszhuuu):
I have make a patch for this problem.    
[https://github.com/apache/spark/pull/25201] 

> HiveClient shouldn't share Hive object between different thread
> ---
>
> Key: SPARK-21918
> URL: https://issues.apache.org/jira/browse/SPARK-21918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hu Liu,
>Priority: Major
>
> I'm testing the spark thrift server and found that all the DDL statements are 
> run by user hive even if hive.server2.enable.doAs=true
> The root cause is that Hive object is shared between different thread in 
> HiveClientImpl
> {code:java}
>   private def client: Hive = {
> if (clientLoader.cachedHive != null) {
>   clientLoader.cachedHive.asInstanceOf[Hive]
> } else {
>   val c = Hive.get(conf)
>   clientLoader.cachedHive = c
>   c
> }
>   }
> {code}
> But in impersonation mode, we should just share the Hive object inside the 
> thread so that the  metastore client in Hive could be associated with right 
> user.
> we can  pass the Hive object of parent thread to child thread when running 
> the sql to fix it
> I have already had a initial patch for review and I'm glad to work on it if 
> anyone could assign it to me.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true

2019-08-07 Thread angerszhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901846#comment-16901846
 ] 

angerszhu commented on SPARK-5159:
--

I have make a patch for this problem.    
[https://github.com/apache/spark/pull/25201] 

> Thrift server does not respect hive.server2.enable.doAs=true
> 
>
> Key: SPARK-5159
> URL: https://issues.apache.org/jira/browse/SPARK-5159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Andrew Ray
>Priority: Major
> Attachments: spark_thrift_server_log.txt
>
>
> I'm currently testing the spark sql thrift server on a kerberos secured 
> cluster in YARN mode. Currently any user can access any table regardless of 
> HDFS permissions as all data is read as the hive user. In HiveServer2 the 
> property hive.server2.enable.doAs=true causes all access to be done as the 
> submitting user. We should do the same.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

2019-08-07 Thread angerszhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901847#comment-16901847
 ] 

angerszhu commented on SPARK-21918:
---

I have make a patch for this problem.    
[https://github.com/apache/spark/pull/25201] 

> HiveClient shouldn't share Hive object between different thread
> ---
>
> Key: SPARK-21918
> URL: https://issues.apache.org/jira/browse/SPARK-21918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hu Liu,
>Priority: Major
>
> I'm testing the spark thrift server and found that all the DDL statements are 
> run by user hive even if hive.server2.enable.doAs=true
> The root cause is that Hive object is shared between different thread in 
> HiveClientImpl
> {code:java}
>   private def client: Hive = {
> if (clientLoader.cachedHive != null) {
>   clientLoader.cachedHive.asInstanceOf[Hive]
> } else {
>   val c = Hive.get(conf)
>   clientLoader.cachedHive = c
>   c
> }
>   }
> {code}
> But in impersonation mode, we should just share the Hive object inside the 
> thread so that the  metastore client in Hive could be associated with right 
> user.
> we can  pass the Hive object of parent thread to child thread when running 
> the sql to fix it
> I have already had a initial patch for review and I'm glad to work on it if 
> anyone could assign it to me.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27916) SparkThreatServer memory leak when 'spark.sql.hive.thriftServer.singleSession' enabled

2019-08-07 Thread angerszhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu resolved SPARK-27916.
---
Resolution: Invalid

> SparkThreatServer memory leak when 
> 'spark.sql.hive.thriftServer.singleSession' enabled
> --
>
> Key: SPARK-27916
> URL: https://issues.apache.org/jira/browse/SPARK-27916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: angerszhu
>Priority: Major
>
> When we use SparkThriftServer, when set config 
> spark.sql.hive.thriftServer.singleSession =  true
> Each client session will create a SparkSession Object, but this 
> InheritThreadLocal Object will not be released since when we call 
> SparkSession.sql(), It just call ThreadLocal's set method. This make 
> SparkSession Object remained in JVM, not cleared
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28169) Spark can’t push down partition predicate for OR expression

2019-08-07 Thread angerszhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-28169:
--
Summary: Spark can’t push down partition predicate for OR expression  (was: 
Spark can’t push down predicate for OR expression)

> Spark can’t push down partition predicate for OR expression
> ---
>
> Key: SPARK-28169
> URL: https://issues.apache.org/jira/browse/SPARK-28169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: angerszhu
>Priority: Major
>  Labels: SQL
>
> Spark can't push down filter condition of Or:
> Such as if I have a table {color:#d04437}default.test{color}, his partition 
> col is "{color:#d04437}dt{color}",
> if I use query : 
> {code:java}
> select * from default.test where dt=20190625 or (dt = 20190626 and id in 
> (1,2,3) )
> {code}
> In this case, Spark will resolve or condition as one expression, and since 
> this {color:#33}expr {color}has reference of "{color:#FF}id{color}", 
> then it can't been push down.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org