[jira] [Commented] (SPARK-24630) SPIP: Support SQLStreaming in Spark

2018-11-27 Thread Jacky Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701444#comment-16701444
 ] 

Jacky Li commented on SPARK-24630:
--

Beside the CREATE TABLE/STREAM to create the source and the sink, is there any 
syntax to manipulate the streaming job, like starting the job and stopping the 
stop the job? If I understand correctly, currently INSERT statement is proposed 
to kick off the structstreaming job, since this streaming job is continous, I 
am wondering is there a way to show/desc/stop it?


> SPIP: Support SQLStreaming in Spark
> ---
>
> Key: SPARK-24630
> URL: https://issues.apache.org/jira/browse/SPARK-24630
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Jackey Lee
>Priority: Minor
>  Labels: SQLStreaming
> Attachments: SQLStreaming SPIP.pdf
>
>
> At present, KafkaSQL, Flink SQL(which is actually based on Calcite), 
> SQLStream, StormSQL all provide a stream type SQL interface, with which users 
> with little knowledge about streaming,  can easily develop a flow system 
> processing model. In Spark, we can also support SQL API based on 
> StructStreamig.
> To support for SQL Streaming, there are two key points: 
> 1, Analysis should be able to parse streaming type SQL. 
> 2, Analyzer should be able to map metadata information to the corresponding 
> Relation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19700) Design an API for pluggable scheduler implementations

2018-11-27 Thread Utkarsh Maheshwari (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701439#comment-16701439
 ] 

Utkarsh Maheshwari commented on SPARK-19700:


[~cgbaker], have you started working on it yet? Is there any way I can help? I 
would be glad to.

> Design an API for pluggable scheduler implementations
> -
>
> Key: SPARK-19700
> URL: https://issues.apache.org/jira/browse/SPARK-19700
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Matt Cheah
>Priority: Major
>
> One point that was brought up in discussing SPARK-18278 was that schedulers 
> cannot easily be added to Spark without forking the whole project. The main 
> reason is that much of the scheduler's behavior fundamentally depends on the 
> CoarseGrainedSchedulerBackend class, which is not part of the public API of 
> Spark and is in fact quite a complex module. As resource management and 
> allocation continues evolves, Spark will need to be integrated with more 
> cluster managers, but maintaining support for all possible allocators in the 
> Spark project would be untenable. Furthermore, it would be impossible for 
> Spark to support proprietary frameworks that are developed by specific users 
> for their other particular use cases.
> Therefore, this ticket proposes making scheduler implementations fully 
> pluggable. The idea is that Spark will provide a Java/Scala interface that is 
> to be implemented by a scheduler that is backed by the cluster manager of 
> interest. The user can compile their scheduler's code into a JAR that is 
> placed on the driver's classpath. Finally, as is the case in the current 
> world, the scheduler implementation is selected and dynamically loaded 
> depending on the user's provided master URL.
> Determining the correct API is the most challenging problem. The current 
> CoarseGrainedSchedulerBackend handles many responsibilities, some of which 
> will be common across all cluster managers, and some which will be specific 
> to a particular cluster manager. For example, the particular mechanism for 
> creating the executor processes will differ between YARN and Mesos, but, once 
> these executors have started running, the means to submit tasks to them over 
> the Netty RPC is identical across the board.
> We must also consider a plugin model and interface for submitting the 
> application as well, because different cluster managers support different 
> configuration options, and thus the driver must be bootstrapped accordingly. 
> For example, in YARN mode the application and Hadoop configuration must be 
> packaged and shipped to the distributed cache prior to launching the job. A 
> prototype of a Kubernetes implementation starts a Kubernetes pod that runs 
> the driver in cluster mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26159) Codegen for LocalTableScanExec

2018-11-27 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26159:
---

Assignee: Juliusz Sompolski

> Codegen for LocalTableScanExec
> --
>
> Key: SPARK-26159
> URL: https://issues.apache.org/jira/browse/SPARK-26159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
> Fix For: 3.0.0
>
>
> Do codegen for LocalTableScanExec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26159) Codegen for LocalTableScanExec

2018-11-27 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26159.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23127
[https://github.com/apache/spark/pull/23127]

> Codegen for LocalTableScanExec
> --
>
> Key: SPARK-26159
> URL: https://issues.apache.org/jira/browse/SPARK-26159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
> Fix For: 3.0.0
>
>
> Do codegen for LocalTableScanExec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale

2018-11-27 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701401#comment-16701401
 ] 

Wenchen Fan commented on SPARK-26155:
-

Can you send a PR to revert SPARK-21052 and post the benchmark result there? 
Then we can start a discussion on that PR and merge it if everyone is fine with 
it.

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 3TB scale
> --
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
> Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis 
> in Spark2.3 without L486&487.pdf, q19.sql
>
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>   . Q19 costs about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23545) [Spark-Core] port opened by the SparkDriver is vulnerable for flooding attacks

2018-11-27 Thread sandeep katta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandeep katta resolved SPARK-23545.
---
Resolution: Invalid

> [Spark-Core] port opened by the SparkDriver is vulnerable for flooding attacks
> --
>
> Key: SPARK-23545
> URL: https://issues.apache.org/jira/browse/SPARK-23545
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: sandeep katta
>Priority: Major
>
> port opened by the SparkDriver is vulnerable for flooding attacks
> *Steps*:
> set spark.network.timeout=60s //can be any value
> Start the thriftserver in client mode and you can see in below logs that the 
> spark Driver opens the port for AM and executors to communicate.
> Logs:
> 018-03-01 16:11:16,497 | INFO  | [main] | Successfully started service 
> *'sparkDriver'* on port *22643*. | 
> org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
> 2018-03-01 16:11:17,265 | INFO  | [main] | Successfully started service 
> 'SparkUI' on port 22950. | 
> org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
> 2018-03-01 16:11:44,640 | INFO  | [main] | Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 22663. | 
> org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
> 2018-03-01 16:11:52,822 | INFO  | [Thread-56] | Starting 
> ThriftBinaryCLIService on port 22550 with 5...501 worker threads | 
> org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:111)
> Do telnet to this port using *telnet IP 22643* command and keep it idle, 
> after 60 seconds check the status, connection is still established, it should 
> be terminated
> *lsof command output along with the date*
>  
> host1:/var/ # date
>  Thu Mar 1 *16:12:55* CST 2018
>  host1:/var/ # lsof | grep 22643
>  java 66730 user1 292u IPv6 1482635919 0t0 TCP 
> host1:22643->*10.18.152.191:59297* (ESTABLISHED)
>  java 66730 user1 297u IPv6 1482374122 0t0 TCP 
> host1:22643->BLR118529:43894 (ESTABLISHED)
>  java 66730 user1 346u IPv6 1482314249 0t0 TCP host1:22643 (LISTEN)
>  host1:/var/ # date
>  Thu Mar 1 16:13:43 CST 2018
>  host1:/var/ # date
>  Thu Mar 1 *16:16:55* CST 2018
>  host1:/var/ # lsof | grep 22643
>  java 66730 user1 292u IPv6 1482635919 0t0 TCP 
> host1:22643->*10.18.152.191:59297* (ESTABLISHED)
>  java 66730 user1 297u IPv6 1482374122 0t0 TCP 
> host1:22643->BLR118529:43894 (ESTABLISHED)
>  java 66730 user1 346u IPv6 1482314249 0t0 TCP host1:22643 (LISTEN)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24176) The hdfs file path with wildcard can not be identified when loading data

2018-11-27 Thread ABHISHEK KUMAR GUPTA (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ABHISHEK KUMAR GUPTA resolved SPARK-24176.
--
Resolution: Duplicate

closed as 23425 JIRA

> The hdfs file path with wildcard can not be identified when loading data
> 
>
> Key: SPARK-24176
> URL: https://issues.apache.org/jira/browse/SPARK-24176
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: OS: SUSE11
> Spark Version:2.3
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> # Launch spark-sql
>  # create table wild1 (time timestamp, name string, isright boolean, 
> datetoday date, num binary, height double, score float, decimaler 
> decimal(10,0), id tinyint, age int, license bigint, length smallint) row 
> format delimited fields terminated by ',' stored as textfile;
>  # loaded data in table as below and it failed some cases not consistent
>  # load data inpath '/user/testdemo1/user1/?ype* ' into table wild1; - Success
> load data inpath '/user/testdemo1/user1/t??eddata60.txt' into table wild1; - 
> *Failed*
> load data inpath '/user/testdemo1/user1/?ypeddata60.txt' into table wild1; - 
> Success
> Exception as below
> > load data inpath '/user/testdemo1/user1/t??eddata61.txt' into table wild1;
> 2018-05-04 13:16:25 INFO HiveMetaStore:746 - 0: get_database: one
> 2018-05-04 13:16:25 INFO audit:371 - ugi=spark/had...@hadoop.com 
> ip=unknown-ip-addr cmd=get_database: one
> 2018-05-04 13:16:25 INFO HiveMetaStore:746 - 0: get_table : db=one tbl=wild1
> 2018-05-04 13:16:25 INFO audit:371 - ugi=spark/had...@hadoop.com 
> ip=unknown-ip-addr cmd=get_table : db=one tbl=wild1
> 2018-05-04 13:16:25 INFO HiveMetaStore:746 - 0: get_table : db=one tbl=wild1
> 2018-05-04 13:16:25 INFO audit:371 - ugi=spark/had...@hadoop.com 
> ip=unknown-ip-addr cmd=get_table : db=one tbl=wild1
> *Error in query: LOAD DATA input path does not exist: 
> /user/testdemo1/user1/t??eddata61.txt;*
> spark-sql>
> Behavior is not consistent. Need to fix with all combination of wild card 
> char as it is not consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26189) Fix the doc of unionAll in SparkR

2018-11-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26189:


Assignee: (was: Apache Spark)

> Fix the doc of unionAll in SparkR
> -
>
> Key: SPARK-26189
> URL: https://issues.apache.org/jira/browse/SPARK-26189
> Project: Spark
>  Issue Type: Documentation
>  Components: R
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Minor
>
> We should fix the doc of unionAll in SparkR. See the discussion: 
> https://github.com/apache/spark/pull/23131/files#r236760822



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26189) Fix the doc of unionAll in SparkR

2018-11-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701383#comment-16701383
 ] 

Apache Spark commented on SPARK-26189:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/23161

> Fix the doc of unionAll in SparkR
> -
>
> Key: SPARK-26189
> URL: https://issues.apache.org/jira/browse/SPARK-26189
> Project: Spark
>  Issue Type: Documentation
>  Components: R
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Minor
>
> We should fix the doc of unionAll in SparkR. See the discussion: 
> https://github.com/apache/spark/pull/23131/files#r236760822



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26189) Fix the doc of unionAll in SparkR

2018-11-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26189:


Assignee: Apache Spark

> Fix the doc of unionAll in SparkR
> -
>
> Key: SPARK-26189
> URL: https://issues.apache.org/jira/browse/SPARK-26189
> Project: Spark
>  Issue Type: Documentation
>  Components: R
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Minor
>
> We should fix the doc of unionAll in SparkR. See the discussion: 
> https://github.com/apache/spark/pull/23131/files#r236760822



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26196) Total tasks message in the stage is incorrect, when there are failed or killed tasks

2018-11-27 Thread shahid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-26196:
---
Description: 
Total tasks message in the stage page is incorrect when there are failed or 
killed tasks.

 

  was:
Total tasks in the stage page is incorrect when there are failed or killed 
tasks.

 


> Total tasks message in the stage is incorrect, when there are failed or 
> killed tasks
> 
>
> Key: SPARK-26196
> URL: https://issues.apache.org/jira/browse/SPARK-26196
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: shahid
>Priority: Major
>
> Total tasks message in the stage page is incorrect when there are failed or 
> killed tasks.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26196) Total tasks message in the stage is incorrect, when there are failed or killed tasks

2018-11-27 Thread shahid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-26196:
---
Summary: Total tasks message in the stage is incorrect, when there are 
failed or killed tasks  (was: Total tasks message in the stage in incorrect, 
when there are failed or killed tasks)

> Total tasks message in the stage is incorrect, when there are failed or 
> killed tasks
> 
>
> Key: SPARK-26196
> URL: https://issues.apache.org/jira/browse/SPARK-26196
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: shahid
>Priority: Major
>
> Total tasks in the stage page is incorrect when there are failed or killed 
> tasks.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-11-27 Thread xuqianjin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701321#comment-16701321
 ] 

xuqianjin commented on SPARK-23410:
---

hi [~maxgekk] [~hyukjin.kwon] Thank you very much. Can I just pull the code 
from the latest master branch and open a PR?

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale

2018-11-27 Thread Ke Jia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701317#comment-16701317
 ] 

Ke Jia commented on SPARK-26155:


[~viirya] Thanks for your reply. 

> "Q19 analysis in Spark2.3 without L486 & 487.pdf" has Stage time and DAG in 
>Spark 2.1, but the document title is Spark 2.3. Which version Spark is used 
>for it?

My spark version is Spark2.3. And the "Stage time and DAG in Spark2.1" is my 
mistake. And I have re-uploaded.

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 3TB scale
> --
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
> Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis 
> in Spark2.3 without L486&487.pdf, q19.sql
>
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>   . Q19 costs about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26182) Cost increases when optimizing scalaUDF

2018-11-27 Thread Jiayi Liao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiayi Liao updated SPARK-26182:
---
Description: 
Let's assume that we have a udf called splitUDF which outputs a map data.
 The SQL
{code:java}
select
g['a'], g['b']
from
   ( select splitUDF(x) as g from table) tbl
{code}
will be optimized to the same logical plan of
{code:java}
select splitUDF(x)['a'], splitUDF(x)['b'] from table
{code}
which means that the splitUDF is executed twice instead of once.

The optimization is from CollapseProject. 
 I'm not sure whether this is a bug or not. Please tell me if I was wrong about 
this.

  was:
Let's Assume that we have a udf called splitUDF which outputs a map data.
 The SQL
{code:java}
select
g['a'], g['b']
from
   ( select splitUDF(x) as g from table) tbl
{code}
will be optimized to the same logical plan of
{code:java}
select splitUDF(x)['a'], splitUDF(x)['b'] from table
{code}
which means that the splitUDF is executed twice instead of once.

The optimization is from CollapseProject. 
 I'm not sure whether this is a bug or not. Please tell me if I was wrong about 
this.


> Cost increases when optimizing scalaUDF
> ---
>
> Key: SPARK-26182
> URL: https://issues.apache.org/jira/browse/SPARK-26182
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.4.0
>Reporter: Jiayi Liao
>Priority: Major
>
> Let's assume that we have a udf called splitUDF which outputs a map data.
>  The SQL
> {code:java}
> select
> g['a'], g['b']
> from
>( select splitUDF(x) as g from table) tbl
> {code}
> will be optimized to the same logical plan of
> {code:java}
> select splitUDF(x)['a'], splitUDF(x)['b'] from table
> {code}
> which means that the splitUDF is executed twice instead of once.
> The optimization is from CollapseProject. 
>  I'm not sure whether this is a bug or not. Please tell me if I was wrong 
> about this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26197) Spark master fails to detect driver process pause

2018-11-27 Thread Jialin LIu (JIRA)
Jialin LIu created SPARK-26197:
--

 Summary: Spark master fails to detect driver process pause
 Key: SPARK-26197
 URL: https://issues.apache.org/jira/browse/SPARK-26197
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.2
Reporter: Jialin LIu


I was using Spark 2.3.2 with standalone cluster and submit job using cluster 
mode. After I submit the job, I deliberately pause the driver process 
(throughout shell command "kill -stop (driver process id) ") to see if the 
master can detect this problem. The result shows that the driver will never 
stop. All the executors will try to talk back to driver and will give up in 10 
minutes. Master can detect executor failures and try to reassign new executor 
process to redo the job. New executor will try to create RPC connection with 
driver and will fail in 2 minutes. Master will endlessly spawn new executors 
without detecting driver failure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale

2018-11-27 Thread Ke Jia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-26155:
---
Attachment: (was: Q19 analysis in Spark2.3 without L486 & 487.pdf)

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 3TB scale
> --
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
> Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis 
> in Spark2.3 without L486&487.pdf, q19.sql
>
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>   . Q19 costs about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale

2018-11-27 Thread Ke Jia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-26155:
---
Attachment: Q19 analysis in Spark2.3 without L486&487.pdf

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 3TB scale
> --
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
> Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis 
> in Spark2.3 without L486&487.pdf, q19.sql
>
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>   . Q19 costs about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26196) Total tasks message in the stage in incorrect, when there are failed or killed tasks

2018-11-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701294#comment-16701294
 ] 

Apache Spark commented on SPARK-26196:
--

User 'shahidki31' has created a pull request for this issue:
https://github.com/apache/spark/pull/23160

> Total tasks message in the stage in incorrect, when there are failed or 
> killed tasks
> 
>
> Key: SPARK-26196
> URL: https://issues.apache.org/jira/browse/SPARK-26196
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: shahid
>Priority: Major
>
> Total tasks in the stage page is incorrect when there are failed or killed 
> tasks.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26196) Total tasks message in the stage in incorrect, when there are failed or killed tasks

2018-11-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26196:


Assignee: Apache Spark

> Total tasks message in the stage in incorrect, when there are failed or 
> killed tasks
> 
>
> Key: SPARK-26196
> URL: https://issues.apache.org/jira/browse/SPARK-26196
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: shahid
>Assignee: Apache Spark
>Priority: Major
>
> Total tasks in the stage page is incorrect when there are failed or killed 
> tasks.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26196) Total tasks message in the stage in incorrect, when there are failed or killed tasks

2018-11-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26196:


Assignee: (was: Apache Spark)

> Total tasks message in the stage in incorrect, when there are failed or 
> killed tasks
> 
>
> Key: SPARK-26196
> URL: https://issues.apache.org/jira/browse/SPARK-26196
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: shahid
>Priority: Major
>
> Total tasks in the stage page is incorrect when there are failed or killed 
> tasks.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26195) Correct exception messages in some classes

2018-11-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701279#comment-16701279
 ] 

Apache Spark commented on SPARK-26195:
--

User 'lcqzte10192193' has created a pull request for this issue:
https://github.com/apache/spark/pull/23154

> Correct exception messages in some classes
> --
>
> Key: SPARK-26195
> URL: https://issues.apache.org/jira/browse/SPARK-26195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: lichaoqun
>Priority: Minor
>
> UnsupportedOperationException messages are not the same with method name.This 
> PR correct these messages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26196) Total tasks message in the stage in incorrect, when there are failed or killed tasks

2018-11-27 Thread shahid (JIRA)
shahid created SPARK-26196:
--

 Summary: Total tasks message in the stage in incorrect, when there 
are failed or killed tasks
 Key: SPARK-26196
 URL: https://issues.apache.org/jira/browse/SPARK-26196
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.0.0
Reporter: shahid


Total tasks in the stage page is incorrect when there are failed or killed 
tasks.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26196) Total tasks message in the stage in incorrect, when there are failed or killed tasks

2018-11-27 Thread shahid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701285#comment-16701285
 ] 

shahid commented on SPARK-26196:


I will raise a PR

> Total tasks message in the stage in incorrect, when there are failed or 
> killed tasks
> 
>
> Key: SPARK-26196
> URL: https://issues.apache.org/jira/browse/SPARK-26196
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: shahid
>Priority: Major
>
> Total tasks in the stage page is incorrect when there are failed or killed 
> tasks.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26195) Correct exception messages in some classes

2018-11-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701281#comment-16701281
 ] 

Apache Spark commented on SPARK-26195:
--

User 'lcqzte10192193' has created a pull request for this issue:
https://github.com/apache/spark/pull/23154

> Correct exception messages in some classes
> --
>
> Key: SPARK-26195
> URL: https://issues.apache.org/jira/browse/SPARK-26195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: lichaoqun
>Priority: Minor
>
> UnsupportedOperationException messages are not the same with method name.This 
> PR correct these messages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26195) Correct exception messages in some classes

2018-11-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26195:


Assignee: (was: Apache Spark)

> Correct exception messages in some classes
> --
>
> Key: SPARK-26195
> URL: https://issues.apache.org/jira/browse/SPARK-26195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: lichaoqun
>Priority: Minor
>
> UnsupportedOperationException messages are not the same with method name.This 
> PR correct these messages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26195) Correct exception messages in some classes

2018-11-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26195:


Assignee: Apache Spark

> Correct exception messages in some classes
> --
>
> Key: SPARK-26195
> URL: https://issues.apache.org/jira/browse/SPARK-26195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: lichaoqun
>Assignee: Apache Spark
>Priority: Minor
>
> UnsupportedOperationException messages are not the same with method name.This 
> PR correct these messages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26195) Correct exception messages in some classes

2018-11-27 Thread lichaoqun (JIRA)
lichaoqun created SPARK-26195:
-

 Summary: Correct exception messages in some classes
 Key: SPARK-26195
 URL: https://issues.apache.org/jira/browse/SPARK-26195
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: lichaoqun


UnsupportedOperationException messages are not the same with method name.This 
PR correct these messages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10816) EventTime based sessionization

2018-11-27 Thread Yuanjian Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701275#comment-16701275
 ] 

Yuanjian Li commented on SPARK-10816:
-

Thanks for the fix and benchmark by [~ivoson],  fix commit has been merged into 
[https://github.com/apache/spark/pull/22583.] 

[~kabhwan] Is there any possible to combine our proposal together and fix this 
issue? I think benchmark currently are flatted and hope we can solve this 
problem together.

> EventTime based sessionization
> --
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-10816 Support session window natively.pdf, Session 
> Window Support For Structure Streaming.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26194) Support automatic spark.authenticate secret in Kubernetes backend

2018-11-27 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-26194:
--

 Summary: Support automatic spark.authenticate secret in Kubernetes 
backend
 Key: SPARK-26194
 URL: https://issues.apache.org/jira/browse/SPARK-26194
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.0.0
Reporter: Marcelo Vanzin


Currently k8s inherits the default behavior for {{spark.authenticate}}, which 
is that the user must provide an auth secret.

k8s doesn't have that requirement and could instead generate its own unique 
per-app secret, and propagate it to executors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24219) Improve the docker build script to avoid copying everything in example

2018-11-27 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-24219.

Resolution: Duplicate

I fixed this as part of SPARK-26025.

> Improve the docker build script to avoid copying everything in example
> --
>
> Key: SPARK-24219
> URL: https://issues.apache.org/jira/browse/SPARK-24219
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Minor
>
> Current docker build script will copy everything under example folder to 
> docker image if it is invoked in dev path, this unnecessarily copies too many 
> files like building temporary files into the docker image. So here propose to 
> improve the script.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24383) spark on k8s: "driver-svc" are not getting deleted

2018-11-27 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-24383.

Resolution: Not A Problem

This has been working reliably for me. If your k8s server is not gc'ing old 
state, then it's probably an issue with your server.

> spark on k8s: "driver-svc" are not getting deleted
> --
>
> Key: SPARK-24383
> URL: https://issues.apache.org/jira/browse/SPARK-24383
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Lenin
>Priority: Major
>
> When the driver pod exists, the "*driver-svc" services created for the driver 
> are not cleaned up. This causes accumulation of services in the k8s layer, at 
> one point no more services can be created. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24577) Spark submit fails with documentation example spark-pi

2018-11-27 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-24577.

Resolution: Duplicate

> Spark submit fails with documentation example spark-pi
> --
>
> Key: SPARK-24577
> URL: https://issues.apache.org/jira/browse/SPARK-24577
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Kuku1
>Priority: Major
>
> The Spark-submit example in the [K8s 
> documentation|http://spark.apache.org/docs/latest/running-on-kubernetes.html#cluster-mode]
>  fails for me.
> {code:java}
> .\spark-submit.cmd --master k8s://https://my-k8s:8443
> --conf spark.kubernetes.namespace=my-namespace --deploy-mode cluster --name 
> spark-pi --class org.apache.spark.examples.SparkPi
> --conf spark.executor.instances=5
> --conf spark.kubernetes.container.image=gcr.io/ynli-k8s/spark:v2.3.0
> --conf spark.kubernetes.driver.pod.name=spark-pi-driver 
> local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
> {code}
> Error in the driver log:
> {code:java}
> ++ id -u
> + myuid=0
> ++ id -g
> + mygid=0
> ++ getent passwd 0
> + uidentry=root:x:0:0:root:/root:/bin/ash
> + '[' -z root:x:0:0:root:/root:/bin/ash ']'
> + SPARK_K8S_CMD=driver
> + '[' -z driver ']'
> + shift 1
> + SPARK_CLASSPATH=':/opt/spark/jars/*'
> + env
> + grep SPARK_JAVA_OPT_
> + sed 's/[^=]*=\(.*\)/\1/g'
> + readarray -t SPARK_JAVA_OPTS
> + '[' -n 
> '/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar;/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar'
>  ']'
> + 
> SPARK_CLASSPATH=':/opt/spark/jars/*:/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar;/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar'
> + '[' -n '' ']'
> + case "$SPARK_K8S_CMD" in
> + CMD=(${JAVA_HOME}/bin/java "${SPARK_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" 
> -Xms$SPARK_DRIVER_MEMORY -Xmx$SPARK_DRIVER_MEMORY 
> -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS 
> $SPARK_DRIVER_ARGS)
> + exec /sbin/tini -s -- /usr/lib/jvm/java-1.8-openjdk/bin/java 
> -Dspark.kubernetes.namespace=my-namespace -Dspark.driver.port=7078 
> -Dspark.master=k8s://https://my-k8s:8443  
> -Dspark.jars=/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar,/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
>  -Dspark.driver.blockManager.port=7079 
> -Dspark.app.id=spark-311b7351345240fd89d6d86eaabdff6f 
> -Dspark.kubernetes.driver.pod.name=spark-pi-driver 
> -Dspark.executor.instances=5 -Dspark.app.name=spark-pi 
> -Dspark.driver.host=spark-pi-ef6be7cac60a3f789f9714b2ebd1c68c-driver-svc.my-namespace.svc
>  -Dspark.submit.deployMode=cluster 
> -Dspark.kubernetes.executor.podNamePrefix=spark-pi-ef6be7cac60a3f789f9714b2ebd1c68c
>  -Dspark.kubernetes.container.image=gcr.io/ynli-k8s/spark:v2.3.0 -cp 
> ':/opt/spark/jars/*:/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar;/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar'
>  -Xms1g -Xmx1g -Dspark.driver.bindAddress=172.101.1.40 
> org.apache.spark.examples.SparkPi
> Error: Could not find or load main class org.apache.spark.examples.SparkPi
> {code}
> I am also using spark-operator to run the example and this one works for me. 
> The spark-operator outputs its command to spark-submit:
>  
> {code:java}
> ++ id -u
> + myuid=0
> ++ id -g
> + mygid=0
> ++ getent passwd 0
> + uidentry=root:x:0:0:root:/root:/bin/ash
> + '[' -z root:x:0:0:root:/root:/bin/ash ']'
> + SPARK_K8S_CMD=driver
> + '[' -z driver ']'
> + shift 1
> + SPARK_CLASSPATH=':/opt/spark/jars/*'
> + env
> + grep SPARK_JAVA_OPT_
> + sed 's/[^=]*=\(.*\)/\1/g'
> + readarray -t SPARK_JAVA_OPTS
> + '[' -n 
> /opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar:/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
>  ']'
> + 
> SPARK_CLASSPATH=':/opt/spark/jars/*:/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar:/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar'
> + '[' -n '' ']'
> + case "$SPARK_K8S_CMD" in
> + CMD=(${JAVA_HOME}/bin/java "${SPARK_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" 
> -Xms$SPARK_DRIVER_MEMORY -Xmx$SPARK_DRIVER_MEMORY
> -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS 
> $SPARK_DRIVER_ARGS)
> + exec /sbin/tini -s -- /usr/lib/jvm/java-1.8-openjdk/bin/java
> -Dspark.kubernetes.driver.label.sparkoperator.k8s.io/app-id=spark-pi-2557211557
> -Dspark.kubernetes.container.image=gcr.io/ynli-k8s/spark:v2.3.0
> -Dspark.kubernetes.executor.label.sparkoperator.k8s.io/app-name=spark-pi
> -Dspark.app.name=spark-pi
> -Dspark.executor.instances=7
> -Dspark.driver.blockManager.port=7079
> -Dspark.driver.cores=0.10
> -Dspark.kubernetes.driver.label.version=2.3.0
> -Dspark.kubernetes.executor.podNamePrefix=spark-pi-607e0943cf32319883cc3beb2e02be4f
> -Dspark.executor.memory=512m
> -Dspark.kubernetes.driver.label.

[jira] [Resolved] (SPARK-24600) Improve support for building different types of images in dockerfile

2018-11-27 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-24600.

Resolution: Duplicate

> Improve support for building different types of images in dockerfile
> 
>
> Key: SPARK-24600
> URL: https://issues.apache.org/jira/browse/SPARK-24600
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>
> Our docker images currently build and push docker images for pyspark and 
> java/scala.
> We should be able to build/push either one of them. In the future, we'll have 
> this extended to sparkR, the shuffle service, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26096) k8s integration tests should run R tests

2018-11-27 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26096.

Resolution: Duplicate

> k8s integration tests should run R tests
> 
>
> Key: SPARK-26096
> URL: https://issues.apache.org/jira/browse/SPARK-26096
> Project: Spark
>  Issue Type: Task
>  Components: Kubernetes, Tests
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> Noticed while debugging a completely separate things.
> - the jenkins job doesn't enable the SparkR profile
> - KubernetesSuite doesn't include the RTestsSuite trait
> even if you fix those two, it seems the tests are broken:
> {noformat}
> [info] - Run SparkR on simple dataframe.R example *** FAILED *** (2 minutes, 
> 3 seconds)
> [info]   at 
> org.scalatest.concurrent.Eventually.eventually(Eventually.scala:308)
> [info]   at 
> org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:307)
> [info]   at 
> org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:479)
> [info]   at 
> org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.runSparkApplicationAndVerifyCompletion(KubernetesSuite.scala:274)
> [info]   at 
> org.apache.spark.deploy.k8s.integrationtest.RTestsSuite.$anonfun$$init$$1(RTestsSuite.scala:26)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26125) Delegation Token seems not appropriately stored on secrets of Kubernetes/Kerberized HDFS

2018-11-27 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701200#comment-16701200
 ] 

Marcelo Vanzin commented on SPARK-26125:


Pretty sure this works with my patch for SPARK-25815, but since this is a 
different bug from that one, will keep it separate (and close together).

> Delegation Token seems not appropriately stored on secrets of 
> Kubernetes/Kerberized HDFS
> 
>
> Key: SPARK-26125
> URL: https://issues.apache.org/jira/browse/SPARK-26125
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Kei Kori
>Priority: Minor
> Attachments: spark-submit-stern.log
>
>
> I tried Kerberos authentication with Kubernetes Resource Manager and an 
> external Hadoop and KDC.
> I tested built on 
> [6c9c84f|https://github.com/apache/spark/commit/6c9c84ffb9c8d98ee2ece7ba4b010856591d383d]
>  (master + SPARK-23257).
> {code}
> $ bin/spark-submit \
>   --deploy-mode cluster \
>   --class org.apache.spark.examples.HdfsTest \
>   --master k8s://https://master01.node:6443 \
>   --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
>   --conf spark.app.name=spark-hdfs \
>   --conf spark.executer.instances=1 \
>   --conf 
> spark.kubernetes.container.image=docker-registry/kkori/spark:6c9c84f \
>   --conf spark.kubernetes.kerberos.enabled=true \
>   --conf spark.kubernetes.kerberos.krb5.configMapName=krb5-conf \
>   --conf spark.kubernetes.kerberos.keytab=/tmp/test.keytab \
>   --conf 
> spark.kubernetes.kerberos.principal=t...@external.kerberos.realm.com \
>   --conf spark.kubernetes.hadoop.configMapName=hadoop-conf \
>   local:///opt/spark/examples/jars/spark-examples_2.11-3.0.0-SNAPSHOT.jar
> {code}
> I successfully submitted into Kubernetes RM and Kubernetes spawned 
> spark-driver and executors,
> but Hadoop Delegation Token seems wrongly stored into Kubernetes secrets, 
> since that contains only header like below:
> {code}
> $ kubectl get secrets spark-hdfs-1542613661459-delegation-tokens -o 
> jsonpath='{.data.hadoop-tokens}' | {base64 -d | cat -A; echo;}
> HDTS^@^@^@
> {code}
> The result of "kubectl get secrets" should be like folloing(I masked the 
> actual result):
> {code}
> HDTS^@^ha-hdfs:test^@^_t...@external.kerberos.realm.com^@^@
> {code}
> As a result, spark-driver threw GSSException for each access of HDFS.
> Full logs(submit, driver, executor) are attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25744) Allow kubernetes integration tests to be run against a real cluster.

2018-11-27 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-25744.

Resolution: Duplicate

> Allow kubernetes integration tests to be run against a real cluster.
> 
>
> Key: SPARK-25744
> URL: https://issues.apache.org/jira/browse/SPARK-25744
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Prashant Sharma
>Priority: Minor
>
> Currently, tests can only run against a minikube cluster, testing against a 
> real cluster gives more flexibility in writing tests with more number of 
> executors and resources.
> It will also be helpful, if minikube is unavailable for testing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26064) Unable to fetch jar from remote repo while running spark-submit on kubernetes

2018-11-27 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26064.

Resolution: Invalid

I'm closing this for the time being. If you have a question please use the 
mailing lists. If you're reporting an issue, please provide more information 
(like the actual error).

> Unable to fetch jar from remote repo while running spark-submit on kubernetes
> -
>
> Key: SPARK-26064
> URL: https://issues.apache.org/jira/browse/SPARK-26064
> Project: Spark
>  Issue Type: Question
>  Components: Kubernetes
>Affects Versions: 2.3.2
>Reporter: Bala Bharath Reddy Resapu
>Priority: Major
>
> I am trying to run spark on kubernetes with a docker image. My requirement is 
> to download the jar from the external repo while running spark-submit. I am 
> able to download the jar using wget in the container but it doesn't work when 
> inputting in the spark-submit command. I am not packaging the jar with docker 
> image. It works fine when I input the jar file inside the docker image. 
>  
> ./bin/spark-submit \
> --master k8s://[https://ip:port|https://ipport/] \
> --deploy-mode cluster \
> --name test3 \
> --class hello \
> --conf spark.kubernetes.container.image.pullSecrets=abcd \
> --conf spark.kubernetes.container.image=spark:h2.0 \
> [https://devops.com/artifactory/local/testing/testing_2.11/h|https://bala.bharath.reddy.resapu%40ibm.com:akcp5bcbktykg2ti28sju4gtebsqwkg2mqkaf9w6g5rdbo3iwrwx7qb1m5dokgd54hdru2...@na.artifactory.swg-devops.com/artifactory/txo-cedp-garage-artifacts-sbt-local/testing/testing_2.11/arithmetic.jar]ello.jar



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26193) Implement shuffle write metrics in SQL

2018-11-27 Thread Xiao Li (JIRA)
Xiao Li created SPARK-26193:
---

 Summary: Implement shuffle write metrics in SQL
 Key: SPARK-26193
 URL: https://issues.apache.org/jira/browse/SPARK-26193
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Xiao Li
Assignee: Yuanjian Li






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26191) Control number of truncated fields

2018-11-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26191:


Assignee: Apache Spark

> Control number of truncated fields
> --
>
> Key: SPARK-26191
> URL: https://issues.apache.org/jira/browse/SPARK-26191
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, the threshold for truncated fields converted to string can be 
> controlled via global SQL config. Need to add the maxFields parameter to all 
> functions/methods that potentially could produce truncated string from a 
> sequence of fields.
> One of use cases is toFile. This method aims to output not truncated plans. 
> For now users has to set global config to flush whole plans.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26190) SparkLauncher: Allow users to set their own submitter script instead of hardcoded spark-submit

2018-11-27 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26190.

Resolution: Won't Fix

I'm closing this for now until I see a better use case. Seems to be you can 
easily do this if you want to without needing changes in Spark.

> SparkLauncher: Allow users to set their own submitter script instead of 
> hardcoded spark-submit
> --
>
> Key: SPARK-26190
> URL: https://issues.apache.org/jira/browse/SPARK-26190
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, Spark Core, Spark Submit
>Affects Versions: 2.1.0
> Environment: Apache Spark 2.0.1 on yarn cluster (MapR distribution)
>Reporter: Gyanendra Dwivedi
>Priority: Major
>
> The improvement request is for improvement in the SparkLauncher class which 
> is responsible to execute builtin spark-submit script using Java API.
> In my use case, there is a custom wrapper script which help in integrating 
> the security features while submitting the spark job using builtin 
> spark-submit.
> Currently the script name is hard-coded in the 'createBuilder()' method of 
> org.apache.spark.launcher.SparkLauncher class:
> {code:java}
> // code placeholder
> private ProcessBuilder createBuilder() {
> List cmd = new ArrayList();
> String script = CommandBuilderUtils.isWindows() ? "spark-submit.cmd" : 
> "spark-submit";
> cmd.add(CommandBuilderUtils.join(File.separator, new 
> String[]{this.builder.getSparkHome(), "bin", script}));
> cmd.addAll(this.builder.buildSparkSubmitArgs());
> ..
> ..
> }{code}
>  
>  
> It has following issues, which prevents its usage in certain scenario. 
> 1) Developer may not use their own custom scripts with different name. They 
> are forced to use the one shipped with the installation. Overwriting that may 
> not be the option, when it is not allowed to alter the original installation.
> 2) The code expect the script to be present at "SPARK_HOME/bin" folder. 
> 3) The 'createBuilder()' method is private and hence, extending the 
> 'org.apache.spark.launcher.SparkLauncher' is not an option.
>  
> Proposed solution:
> 1) Developer should be given an optional parameter to set their own custom 
> script, which may be located at any path.
> 2) Only in case the parameter is not set, the default spark-submit script 
> should be taken from SPARK_HOME/bin folder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26190) SparkLauncher: Allow users to set their own submitter script instead of hardcoded spark-submit

2018-11-27 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701140#comment-16701140
 ] 

Marcelo Vanzin commented on SPARK-26190:


bq. Can you give me one good reason that why you must have the script name 
hard-coded in the SparkLauncher?

Because that is the Spark public interface. You run Spark using spark-submit, 
and you customize spark-submit using spark-env.sh. If that does not work for 
you, you'll need a little more justification than "I want to run my own script".

You have a bunch of options, including patching spark-submit when deploying in 
your environment, that don't require changes in Spark at all.

> SparkLauncher: Allow users to set their own submitter script instead of 
> hardcoded spark-submit
> --
>
> Key: SPARK-26190
> URL: https://issues.apache.org/jira/browse/SPARK-26190
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, Spark Core, Spark Submit
>Affects Versions: 2.1.0
> Environment: Apache Spark 2.0.1 on yarn cluster (MapR distribution)
>Reporter: Gyanendra Dwivedi
>Priority: Major
>
> The improvement request is for improvement in the SparkLauncher class which 
> is responsible to execute builtin spark-submit script using Java API.
> In my use case, there is a custom wrapper script which help in integrating 
> the security features while submitting the spark job using builtin 
> spark-submit.
> Currently the script name is hard-coded in the 'createBuilder()' method of 
> org.apache.spark.launcher.SparkLauncher class:
> {code:java}
> // code placeholder
> private ProcessBuilder createBuilder() {
> List cmd = new ArrayList();
> String script = CommandBuilderUtils.isWindows() ? "spark-submit.cmd" : 
> "spark-submit";
> cmd.add(CommandBuilderUtils.join(File.separator, new 
> String[]{this.builder.getSparkHome(), "bin", script}));
> cmd.addAll(this.builder.buildSparkSubmitArgs());
> ..
> ..
> }{code}
>  
>  
> It has following issues, which prevents its usage in certain scenario. 
> 1) Developer may not use their own custom scripts with different name. They 
> are forced to use the one shipped with the installation. Overwriting that may 
> not be the option, when it is not allowed to alter the original installation.
> 2) The code expect the script to be present at "SPARK_HOME/bin" folder. 
> 3) The 'createBuilder()' method is private and hence, extending the 
> 'org.apache.spark.launcher.SparkLauncher' is not an option.
>  
> Proposed solution:
> 1) Developer should be given an optional parameter to set their own custom 
> script, which may be located at any path.
> 2) Only in case the parameter is not set, the default spark-submit script 
> should be taken from SPARK_HOME/bin folder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26191) Control number of truncated fields

2018-11-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701139#comment-16701139
 ] 

Apache Spark commented on SPARK-26191:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/23159

> Control number of truncated fields
> --
>
> Key: SPARK-26191
> URL: https://issues.apache.org/jira/browse/SPARK-26191
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, the threshold for truncated fields converted to string can be 
> controlled via global SQL config. Need to add the maxFields parameter to all 
> functions/methods that potentially could produce truncated string from a 
> sequence of fields.
> One of use cases is toFile. This method aims to output not truncated plans. 
> For now users has to set global config to flush whole plans.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26191) Control number of truncated fields

2018-11-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26191:


Assignee: (was: Apache Spark)

> Control number of truncated fields
> --
>
> Key: SPARK-26191
> URL: https://issues.apache.org/jira/browse/SPARK-26191
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, the threshold for truncated fields converted to string can be 
> controlled via global SQL config. Need to add the maxFields parameter to all 
> functions/methods that potentially could produce truncated string from a 
> sequence of fields.
> One of use cases is toFile. This method aims to output not truncated plans. 
> For now users has to set global config to flush whole plans.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26192) MesosClusterScheduler reads options from dispatcher conf instead of submission conf

2018-11-27 Thread Martin Loncaric (JIRA)
Martin Loncaric created SPARK-26192:
---

 Summary: MesosClusterScheduler reads options from dispatcher conf 
instead of submission conf
 Key: SPARK-26192
 URL: https://issues.apache.org/jira/browse/SPARK-26192
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 2.4.0, 2.3.2, 2.3.1, 2.3.0
Reporter: Martin Loncaric


There are at least two options accessed in MesosClusterScheduler that should 
come from the submission's configuration instead of the dispatcher's:

spark.app.name
spark.mesos.fetchCache.enable

This means that all Mesos tasks for Spark drivers have uninformative names of 
the form "Driver for (MainClass)" rather than the configured application name, 
and Spark drivers never cache files. Coincidentally, the 
spark.mesos.fetchCache.enable option is misnamed, as referenced in the linked 
JIRA.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26190) SparkLauncher: Allow users to set their own submitter script instead of hardcoded spark-submit

2018-11-27 Thread Gyanendra Dwivedi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701136#comment-16701136
 ] 

Gyanendra Dwivedi commented on SPARK-26190:
---

[~vanzin] I am sorry I am not able to explain you a "real" enterprise level 
limitations for developers. I cannot expose more on why a custom script is the 
only option !!

Can you give me one good reason that why you must have the script name 
hard-coded in the SparkLauncher? Why SparkLauncher should expect the script 
name "spark-submit" for non-window OS in a path SPARK_HOME/bin only?

I am not willing to invest my time justifying any more, feel free to close or 
whatever. This thing was for an Spark improvement from a developer's real time 
challenge.

Anyway by the time it comes back to me (if someone fixes it);  its too late for 
my bus.  

I will just patch it and move on, as I do with most of the poorly written APIs.

> SparkLauncher: Allow users to set their own submitter script instead of 
> hardcoded spark-submit
> --
>
> Key: SPARK-26190
> URL: https://issues.apache.org/jira/browse/SPARK-26190
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, Spark Core, Spark Submit
>Affects Versions: 2.1.0
> Environment: Apache Spark 2.0.1 on yarn cluster (MapR distribution)
>Reporter: Gyanendra Dwivedi
>Priority: Major
>
> The improvement request is for improvement in the SparkLauncher class which 
> is responsible to execute builtin spark-submit script using Java API.
> In my use case, there is a custom wrapper script which help in integrating 
> the security features while submitting the spark job using builtin 
> spark-submit.
> Currently the script name is hard-coded in the 'createBuilder()' method of 
> org.apache.spark.launcher.SparkLauncher class:
> {code:java}
> // code placeholder
> private ProcessBuilder createBuilder() {
> List cmd = new ArrayList();
> String script = CommandBuilderUtils.isWindows() ? "spark-submit.cmd" : 
> "spark-submit";
> cmd.add(CommandBuilderUtils.join(File.separator, new 
> String[]{this.builder.getSparkHome(), "bin", script}));
> cmd.addAll(this.builder.buildSparkSubmitArgs());
> ..
> ..
> }{code}
>  
>  
> It has following issues, which prevents its usage in certain scenario. 
> 1) Developer may not use their own custom scripts with different name. They 
> are forced to use the one shipped with the installation. Overwriting that may 
> not be the option, when it is not allowed to alter the original installation.
> 2) The code expect the script to be present at "SPARK_HOME/bin" folder. 
> 3) The 'createBuilder()' method is private and hence, extending the 
> 'org.apache.spark.launcher.SparkLauncher' is not an option.
>  
> Proposed solution:
> 1) Developer should be given an optional parameter to set their own custom 
> script, which may be located at any path.
> 2) Only in case the parameter is not set, the default spark-submit script 
> should be taken from SPARK_HOME/bin folder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26191) Control number of truncated fields

2018-11-27 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-26191:
--

 Summary: Control number of truncated fields
 Key: SPARK-26191
 URL: https://issues.apache.org/jira/browse/SPARK-26191
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


Currently, the threshold for truncated fields converted to string can be 
controlled via global SQL config. Need to add the maxFields parameter to all 
functions/methods that potentially could produce truncated string from a 
sequence of fields.

One of use cases is toFile. This method aims to output not truncated plans. For 
now users has to set global config to flush whole plans.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26189) Fix the doc of unionAll in SparkR

2018-11-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26189:
--
Priority: Minor  (was: Major)

> Fix the doc of unionAll in SparkR
> -
>
> Key: SPARK-26189
> URL: https://issues.apache.org/jira/browse/SPARK-26189
> Project: Spark
>  Issue Type: Documentation
>  Components: R
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Minor
>
> We should fix the doc of unionAll in SparkR. See the discussion: 
> https://github.com/apache/spark/pull/23131/files#r236760822



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26190) SparkLauncher: Allow users to set their own submitter script instead of hardcoded spark-submit

2018-11-27 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700937#comment-16700937
 ] 

Marcelo Vanzin commented on SPARK-26190:


If you need to run things before spark-submit runs, considering writing your 
own spark-env.sh that does what you need. That's a supported feature of Spark.

Sorry but I still don't think it's a good idea to provide the functionality 
you're asking for.

> SparkLauncher: Allow users to set their own submitter script instead of 
> hardcoded spark-submit
> --
>
> Key: SPARK-26190
> URL: https://issues.apache.org/jira/browse/SPARK-26190
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, Spark Core, Spark Submit
>Affects Versions: 2.1.0
> Environment: Apache Spark 2.0.1 on yarn cluster (MapR distribution)
>Reporter: Gyanendra Dwivedi
>Priority: Major
>
> The improvement request is for improvement in the SparkLauncher class which 
> is responsible to execute builtin spark-submit script using Java API.
> In my use case, there is a custom wrapper script which help in integrating 
> the security features while submitting the spark job using builtin 
> spark-submit.
> Currently the script name is hard-coded in the 'createBuilder()' method of 
> org.apache.spark.launcher.SparkLauncher class:
> {code:java}
> // code placeholder
> private ProcessBuilder createBuilder() {
> List cmd = new ArrayList();
> String script = CommandBuilderUtils.isWindows() ? "spark-submit.cmd" : 
> "spark-submit";
> cmd.add(CommandBuilderUtils.join(File.separator, new 
> String[]{this.builder.getSparkHome(), "bin", script}));
> cmd.addAll(this.builder.buildSparkSubmitArgs());
> ..
> ..
> }{code}
>  
>  
> It has following issues, which prevents its usage in certain scenario. 
> 1) Developer may not use their own custom scripts with different name. They 
> are forced to use the one shipped with the installation. Overwriting that may 
> not be the option, when it is not allowed to alter the original installation.
> 2) The code expect the script to be present at "SPARK_HOME/bin" folder. 
> 3) The 'createBuilder()' method is private and hence, extending the 
> 'org.apache.spark.launcher.SparkLauncher' is not an option.
>  
> Proposed solution:
> 1) Developer should be given an optional parameter to set their own custom 
> script, which may be located at any path.
> 2) Only in case the parameter is not set, the default spark-submit script 
> should be taken from SPARK_HOME/bin folder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26190) SparkLauncher: Allow users to set their own submitter script instead of hardcoded spark-submit

2018-11-27 Thread Gyanendra Dwivedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gyanendra Dwivedi updated SPARK-26190:
--
Description: 
The improvement request is for improvement in the SparkLauncher class which is 
responsible to execute builtin spark-submit script using Java API.

In my use case, there is a custom wrapper script which help in integrating the 
security features while submitting the spark job using builtin spark-submit.

Currently the script name is hard-coded in the 'createBuilder()' method of 
org.apache.spark.launcher.SparkLauncher class:
{code:java}
// code placeholder

private ProcessBuilder createBuilder() {
List cmd = new ArrayList();
String script = CommandBuilderUtils.isWindows() ? "spark-submit.cmd" : 
"spark-submit";
cmd.add(CommandBuilderUtils.join(File.separator, new 
String[]{this.builder.getSparkHome(), "bin", script}));
cmd.addAll(this.builder.buildSparkSubmitArgs());
..
..
}{code}
 

 

It has following issues, which prevents its usage in certain scenario. 

1) Developer may not use their own custom scripts with different name. They are 
forced to use the one shipped with the installation. Overwriting that may not 
be the option, when it is not allowed to alter the original installation.

2) The code expect the script to be present at "SPARK_HOME/bin" folder. 

3) The 'createBuilder()' method is private and hence, extending the 
'org.apache.spark.launcher.SparkLauncher' is not an option.

 

Proposed solution:

1) Developer should be given an optional parameter to set their own custom 
script, which may be located at any path.

2) Only in case the parameter is not set, the default spark-submit script 
should be taken from SPARK_HOME/bin folder.

  was:
Currently the script name is hard-coded in the 'createBuilder()' method of 
org.apache.spark.launcher.SparkLauncher class:
{code:java}
// code placeholder

private ProcessBuilder createBuilder() {
List cmd = new ArrayList();
String script = CommandBuilderUtils.isWindows() ? "spark-submit.cmd" : 
"spark-submit";
cmd.add(CommandBuilderUtils.join(File.separator, new 
String[]{this.builder.getSparkHome(), "bin", script}));
cmd.addAll(this.builder.buildSparkSubmitArgs());
..
..
}{code}
 

 

It has following issues, which prevents its usage in certain scenario. 

1) Developer may not use their own custom scripts with different name. They are 
forced to use the one shipped with the installation. Overwriting that may not 
be the option, when it is not allowed to alter the original installation.

2) The code expect the script to be present at "SPARK_HOME/bin" folder. 

3) The 'createBuilder()' method is private and hence, extending the 
'org.apache.spark.launcher.SparkLauncher' is not an option.

 

Proposed solution:

1) Developer should be given an optional parameter to set their own custom 
script, which may be located at any path.

2) Only in case the parameter is not set, the default spark-submit script 
should be taken from SPARK_HOME/bin folder.


> SparkLauncher: Allow users to set their own submitter script instead of 
> hardcoded spark-submit
> --
>
> Key: SPARK-26190
> URL: https://issues.apache.org/jira/browse/SPARK-26190
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, Spark Core, Spark Submit
>Affects Versions: 2.1.0
> Environment: Apache Spark 2.0.1 on yarn cluster (MapR distribution)
>Reporter: Gyanendra Dwivedi
>Priority: Major
>
> The improvement request is for improvement in the SparkLauncher class which 
> is responsible to execute builtin spark-submit script using Java API.
> In my use case, there is a custom wrapper script which help in integrating 
> the security features while submitting the spark job using builtin 
> spark-submit.
> Currently the script name is hard-coded in the 'createBuilder()' method of 
> org.apache.spark.launcher.SparkLauncher class:
> {code:java}
> // code placeholder
> private ProcessBuilder createBuilder() {
> List cmd = new ArrayList();
> String script = CommandBuilderUtils.isWindows() ? "spark-submit.cmd" : 
> "spark-submit";
> cmd.add(CommandBuilderUtils.join(File.separator, new 
> String[]{this.builder.getSparkHome(), "bin", script}));
> cmd.addAll(this.builder.buildSparkSubmitArgs());
> ..
> ..
> }{code}
>  
>  
> It has following issues, which prevents its usage in certain scenario. 
> 1) Developer may not use their own custom scripts with different name. They 
> are forced to use the one shipped with the installation. Overwriting that may 
> not be the option, when it is not allowed to alter the original installation.
> 2) The code expect the script to be present at "SPARK_HOME/bin" folder. 
> 3) The 'createBuilde

[jira] [Comment Edited] (SPARK-26190) SparkLauncher: Allow users to set their own submitter script instead of hardcoded spark-submit

2018-11-27 Thread Gyanendra Dwivedi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700909#comment-16700909
 ] 

Gyanendra Dwivedi edited comment on SPARK-26190 at 11/27/18 7:51 PM:
-

[~vanzin] If it was so easy to execute any custom script

using {{Runtime.getRuntime().exec(); then why does sparkLauncher exist to 
execute builtin spark-submit script?}}

Creating fake symlinks etc is not a viable solution for production servers 
where installation location just may change with new version etc. Creating a 
symlink like adhoc solution should not be a reason for closing this feature 
request.

Don't know why it was never thought to keep things configurable. Hard coding, 
assuming a specific environment/use case setup and forcing developers to look 
for work around should not be encouraged.


was (Author: gm_dwivedi):
[~vanzin] If it was so easy to run or any custom script

using {{Runtime.getRuntime().exec(); then why does sparkLauncher exist to 
execute builtin spark-submit script?}}

Creating fake symlinks etc is not a viable solution for production servers 
where installation location just may change with new version etc. Creating a 
symlink like adhoc solution should not be a reason for closing this feature 
request.

Don't know why it was never thought to keep things configurable. Hard coding, 
assuming a specific environment/use case setup and forcing developers to look 
for work around should not be encouraged.

> SparkLauncher: Allow users to set their own submitter script instead of 
> hardcoded spark-submit
> --
>
> Key: SPARK-26190
> URL: https://issues.apache.org/jira/browse/SPARK-26190
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, Spark Core, Spark Submit
>Affects Versions: 2.1.0
> Environment: Apache Spark 2.0.1 on yarn cluster (MapR distribution)
>Reporter: Gyanendra Dwivedi
>Priority: Major
>
> Currently the script name is hard-coded in the 'createBuilder()' method of 
> org.apache.spark.launcher.SparkLauncher class:
> {code:java}
> // code placeholder
> private ProcessBuilder createBuilder() {
> List cmd = new ArrayList();
> String script = CommandBuilderUtils.isWindows() ? "spark-submit.cmd" : 
> "spark-submit";
> cmd.add(CommandBuilderUtils.join(File.separator, new 
> String[]{this.builder.getSparkHome(), "bin", script}));
> cmd.addAll(this.builder.buildSparkSubmitArgs());
> ..
> ..
> }{code}
>  
>  
> It has following issues, which prevents its usage in certain scenario. 
> 1) Developer may not use their own custom scripts with different name. They 
> are forced to use the one shipped with the installation. Overwriting that may 
> not be the option, when it is not allowed to alter the original installation.
> 2) The code expect the script to be present at "SPARK_HOME/bin" folder. 
> 3) The 'createBuilder()' method is private and hence, extending the 
> 'org.apache.spark.launcher.SparkLauncher' is not an option.
>  
> Proposed solution:
> 1) Developer should be given an optional parameter to set their own custom 
> script, which may be located at any path.
> 2) Only in case the parameter is not set, the default spark-submit script 
> should be taken from SPARK_HOME/bin folder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26190) SparkLauncher: Allow users to set their own submitter script instead of hardcoded spark-submit

2018-11-27 Thread Gyanendra Dwivedi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700909#comment-16700909
 ] 

Gyanendra Dwivedi edited comment on SPARK-26190 at 11/27/18 7:44 PM:
-

[~vanzin] If it was so easy to run or any custom script

using {{Runtime.getRuntime().exec(); then why does sparkLauncher exist to 
execute builtin spark-submit script?}}

Creating fake symlinks etc is not a viable solution for production servers 
where installation location just may change with new version etc. Creating a 
symlink like adhoc solution should not be a reason for closing this feature 
request.

Don't know why it was never thought to keep things configurable. Hard coding, 
assuming a specific environment/use case setup and forcing developers to look 
for work around should not be encouraged.


was (Author: gm_dwivedi):
[~vanzin] If it was so easy to run spark-submit script

using {{Runtime.getRuntime().exec(); then why does sparkLauncher exist?}}

Creating fake symlinks etc is not a viable solution for production servers 
where installation location just may change with new version etc. Creating a 
symlink like adhoc solution should not be a reason for closing this feature 
request.

Don't know why it was never thought to keep things configurable. Hard coding, 
assuming a specific environment/use case setup and forcing developers to look 
for work around should not be encouraged.

> SparkLauncher: Allow users to set their own submitter script instead of 
> hardcoded spark-submit
> --
>
> Key: SPARK-26190
> URL: https://issues.apache.org/jira/browse/SPARK-26190
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, Spark Core, Spark Submit
>Affects Versions: 2.1.0
> Environment: Apache Spark 2.0.1 on yarn cluster (MapR distribution)
>Reporter: Gyanendra Dwivedi
>Priority: Major
>
> Currently the script name is hard-coded in the 'createBuilder()' method of 
> org.apache.spark.launcher.SparkLauncher class:
> {code:java}
> // code placeholder
> private ProcessBuilder createBuilder() {
> List cmd = new ArrayList();
> String script = CommandBuilderUtils.isWindows() ? "spark-submit.cmd" : 
> "spark-submit";
> cmd.add(CommandBuilderUtils.join(File.separator, new 
> String[]{this.builder.getSparkHome(), "bin", script}));
> cmd.addAll(this.builder.buildSparkSubmitArgs());
> ..
> ..
> }{code}
>  
>  
> It has following issues, which prevents its usage in certain scenario. 
> 1) Developer may not use their own custom scripts with different name. They 
> are forced to use the one shipped with the installation. Overwriting that may 
> not be the option, when it is not allowed to alter the original installation.
> 2) The code expect the script to be present at "SPARK_HOME/bin" folder. 
> 3) The 'createBuilder()' method is private and hence, extending the 
> 'org.apache.spark.launcher.SparkLauncher' is not an option.
>  
> Proposed solution:
> 1) Developer should be given an optional parameter to set their own custom 
> script, which may be located at any path.
> 2) Only in case the parameter is not set, the default spark-submit script 
> should be taken from SPARK_HOME/bin folder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26190) SparkLauncher: Allow users to set their own submitter script instead of hardcoded spark-submit

2018-11-27 Thread Gyanendra Dwivedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gyanendra Dwivedi updated SPARK-26190:
--
Description: 
Currently the script name is hard-coded in the 'createBuilder()' method of 
org.apache.spark.launcher.SparkLauncher class:
{code:java}
// code placeholder

private ProcessBuilder createBuilder() {
List cmd = new ArrayList();
String script = CommandBuilderUtils.isWindows() ? "spark-submit.cmd" : 
"spark-submit";
cmd.add(CommandBuilderUtils.join(File.separator, new 
String[]{this.builder.getSparkHome(), "bin", script}));
cmd.addAll(this.builder.buildSparkSubmitArgs());
..
..
}{code}
 

 

It has following issues, which prevents its usage in certain scenario. 

1) Developer may not use their own custom scripts with different name. They are 
forced to use the one shipped with the installation. Overwriting that may not 
be the option, when it is not allowed to alter the original installation.

2) The code expect the script to be present at "SPARK_HOME/bin" folder. 

3) The 'createBuilder()' method is private and hence, extending the 
'org.apache.spark.launcher.SparkLauncher' is not an option.

 

Proposed solution:

1) Developer should be given an optional parameter to set their own custom 
script, which may be located at any path.

2) Only in case the parameter is not set, the default spark-submit script 
should be taken from SPARK_HOME/bin folder.

  was:
Currently the script name is hard-coded in the 'createBuilder()' method of 
org.apache.spark.launcher.SparkLauncher class:
{code:java}
// code placeholder

private ProcessBuilder createBuilder() {
List cmd = new ArrayList();
String script = CommandBuilderUtils.isWindows() ? "spark-submit.cmd" : 
"spark-submit";
cmd.add(CommandBuilderUtils.join(File.separator, new 
String[]{this.builder.getSparkHome(), "bin", script}));
cmd.addAll(this.builder.buildSparkSubmitArgs());
..
..
}{code}
 

 

It has following issues, which prevents its usage in certain scenario. 

1) Developer may not use their own custom scripts with different name. They are 
forced to use the one shipped with the installation. Overwriting that may not 
be the option, when it is not allowed to alter the original installation.

2) The code expect the script to be present at "SPARK_HOME/bin" folder. 

3) The 'createBuilder()' method is private and hence, extending the 
'org.apache.spark.launcher.SparkLauncher' is not an option.

 

Proposed solution:

1) Developer should be given option to set their own custom script, which may 
be located at any path.

2) Only in case the parameter is not set, the default should be taken from 
SPARK_HOME/bin folder.


> SparkLauncher: Allow users to set their own submitter script instead of 
> hardcoded spark-submit
> --
>
> Key: SPARK-26190
> URL: https://issues.apache.org/jira/browse/SPARK-26190
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, Spark Core, Spark Submit
>Affects Versions: 2.1.0
> Environment: Apache Spark 2.0.1 on yarn cluster (MapR distribution)
>Reporter: Gyanendra Dwivedi
>Priority: Major
>
> Currently the script name is hard-coded in the 'createBuilder()' method of 
> org.apache.spark.launcher.SparkLauncher class:
> {code:java}
> // code placeholder
> private ProcessBuilder createBuilder() {
> List cmd = new ArrayList();
> String script = CommandBuilderUtils.isWindows() ? "spark-submit.cmd" : 
> "spark-submit";
> cmd.add(CommandBuilderUtils.join(File.separator, new 
> String[]{this.builder.getSparkHome(), "bin", script}));
> cmd.addAll(this.builder.buildSparkSubmitArgs());
> ..
> ..
> }{code}
>  
>  
> It has following issues, which prevents its usage in certain scenario. 
> 1) Developer may not use their own custom scripts with different name. They 
> are forced to use the one shipped with the installation. Overwriting that may 
> not be the option, when it is not allowed to alter the original installation.
> 2) The code expect the script to be present at "SPARK_HOME/bin" folder. 
> 3) The 'createBuilder()' method is private and hence, extending the 
> 'org.apache.spark.launcher.SparkLauncher' is not an option.
>  
> Proposed solution:
> 1) Developer should be given an optional parameter to set their own custom 
> script, which may be located at any path.
> 2) Only in case the parameter is not set, the default spark-submit script 
> should be taken from SPARK_HOME/bin folder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26190) SparkLauncher: Allow users to set their own submitter script instead of hardcoded spark-submit

2018-11-27 Thread Gyanendra Dwivedi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700909#comment-16700909
 ] 

Gyanendra Dwivedi commented on SPARK-26190:
---

[~vanzin] If it was so easy to run spark-submit script

using {{Runtime.getRuntime().exec(); then why does sparkLauncher exist?}}

Creating fake symlinks etc is not a viable solution for production servers 
where installation location just may change with new version etc. Creating a 
symlink like adhoc solution should not be a reason for closing this feature 
request.

Don't know why it was never thought to keep things configurable. Hard coding, 
assuming a specific environment/use case setup and forcing developers to look 
for work around should not be encouraged.

> SparkLauncher: Allow users to set their own submitter script instead of 
> hardcoded spark-submit
> --
>
> Key: SPARK-26190
> URL: https://issues.apache.org/jira/browse/SPARK-26190
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, Spark Core, Spark Submit
>Affects Versions: 2.1.0
> Environment: Apache Spark 2.0.1 on yarn cluster (MapR distribution)
>Reporter: Gyanendra Dwivedi
>Priority: Major
>
> Currently the script name is hard-coded in the 'createBuilder()' method of 
> org.apache.spark.launcher.SparkLauncher class:
> {code:java}
> // code placeholder
> private ProcessBuilder createBuilder() {
> List cmd = new ArrayList();
> String script = CommandBuilderUtils.isWindows() ? "spark-submit.cmd" : 
> "spark-submit";
> cmd.add(CommandBuilderUtils.join(File.separator, new 
> String[]{this.builder.getSparkHome(), "bin", script}));
> cmd.addAll(this.builder.buildSparkSubmitArgs());
> ..
> ..
> }{code}
>  
>  
> It has following issues, which prevents its usage in certain scenario. 
> 1) Developer may not use their own custom scripts with different name. They 
> are forced to use the one shipped with the installation. Overwriting that may 
> not be the option, when it is not allowed to alter the original installation.
> 2) The code expect the script to be present at "SPARK_HOME/bin" folder. 
> 3) The 'createBuilder()' method is private and hence, extending the 
> 'org.apache.spark.launcher.SparkLauncher' is not an option.
>  
> Proposed solution:
> 1) Developer should be given option to set their own custom script, which may 
> be located at any path.
> 2) Only in case the parameter is not set, the default should be taken from 
> SPARK_HOME/bin folder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26190) SparkLauncher: Allow users to set their own submitter script instead of hardcoded spark-submit

2018-11-27 Thread Gyanendra Dwivedi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700892#comment-16700892
 ] 

Gyanendra Dwivedi commented on SPARK-26190:
---

[~vanzin] I have a wrapper custom script on top of spark-submit script,  which 
has certain security check before it calls the built-in spark-submit script to 
submit the job. I have to use this script to launch the spark job using a Java 
program. Do I have any other option, kindly help.

> SparkLauncher: Allow users to set their own submitter script instead of 
> hardcoded spark-submit
> --
>
> Key: SPARK-26190
> URL: https://issues.apache.org/jira/browse/SPARK-26190
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, Spark Core, Spark Submit
>Affects Versions: 2.1.0
> Environment: Apache Spark 2.0.1 on yarn cluster (MapR distribution)
>Reporter: Gyanendra Dwivedi
>Priority: Major
>
> Currently the script name is hard-coded in the 'createBuilder()' method of 
> org.apache.spark.launcher.SparkLauncher class:
> {code:java}
> // code placeholder
> private ProcessBuilder createBuilder() {
> List cmd = new ArrayList();
> String script = CommandBuilderUtils.isWindows() ? "spark-submit.cmd" : 
> "spark-submit";
> cmd.add(CommandBuilderUtils.join(File.separator, new 
> String[]{this.builder.getSparkHome(), "bin", script}));
> cmd.addAll(this.builder.buildSparkSubmitArgs());
> ..
> ..
> }{code}
>  
>  
> It has following issues, which prevents its usage in certain scenario. 
> 1) Developer may not use their own custom scripts with different name. They 
> are forced to use the one shipped with the installation. Overwriting that may 
> not be the option, when it is not allowed to alter the original installation.
> 2) The code expect the script to be present at "SPARK_HOME/bin" folder. 
> 3) The 'createBuilder()' method is private and hence, extending the 
> 'org.apache.spark.launcher.SparkLauncher' is not an option.
>  
> Proposed solution:
> 1) Developer should be given option to set their own custom script, which may 
> be located at any path.
> 2) Only in case the parameter is not set, the default should be taken from 
> SPARK_HOME/bin folder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26190) SparkLauncher: Allow users to set their own submitter script instead of hardcoded spark-submit

2018-11-27 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700896#comment-16700896
 ] 

Marcelo Vanzin commented on SPARK-26190:


Just run your script without SparkLauncher? e.g. using 
{{Runtime.getRuntime().exec()}}.

Or create a fake SPARK_HOME that is mostly symlinks to the original SPARK_HOME, 
and has your custom spark-submit.

I don't think this is a good thing to have in Spark. It's a pretty obscure use 
case and very easy for people to do the wrong thing and blame Spark.

> SparkLauncher: Allow users to set their own submitter script instead of 
> hardcoded spark-submit
> --
>
> Key: SPARK-26190
> URL: https://issues.apache.org/jira/browse/SPARK-26190
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, Spark Core, Spark Submit
>Affects Versions: 2.1.0
> Environment: Apache Spark 2.0.1 on yarn cluster (MapR distribution)
>Reporter: Gyanendra Dwivedi
>Priority: Major
>
> Currently the script name is hard-coded in the 'createBuilder()' method of 
> org.apache.spark.launcher.SparkLauncher class:
> {code:java}
> // code placeholder
> private ProcessBuilder createBuilder() {
> List cmd = new ArrayList();
> String script = CommandBuilderUtils.isWindows() ? "spark-submit.cmd" : 
> "spark-submit";
> cmd.add(CommandBuilderUtils.join(File.separator, new 
> String[]{this.builder.getSparkHome(), "bin", script}));
> cmd.addAll(this.builder.buildSparkSubmitArgs());
> ..
> ..
> }{code}
>  
>  
> It has following issues, which prevents its usage in certain scenario. 
> 1) Developer may not use their own custom scripts with different name. They 
> are forced to use the one shipped with the installation. Overwriting that may 
> not be the option, when it is not allowed to alter the original installation.
> 2) The code expect the script to be present at "SPARK_HOME/bin" folder. 
> 3) The 'createBuilder()' method is private and hence, extending the 
> 'org.apache.spark.launcher.SparkLauncher' is not an option.
>  
> Proposed solution:
> 1) Developer should be given option to set their own custom script, which may 
> be located at any path.
> 2) Only in case the parameter is not set, the default should be taken from 
> SPARK_HOME/bin folder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26190) SparkLauncher: Allow users to set their own submitter script instead of hardcoded spark-submit

2018-11-27 Thread Gyanendra Dwivedi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700892#comment-16700892
 ] 

Gyanendra Dwivedi edited comment on SPARK-26190 at 11/27/18 7:23 PM:
-

[~vanzin] I have a wrapper custom script on top of spark-submit script,  which 
has certain security check before it calls the built-in spark-submit script to 
submit the job. I have to use this script to launch the spark job using a Java 
program. Do I have any other option, kindly help.

EDIT: The custom script is not located at SPARK_HOME/bin location.


was (Author: gm_dwivedi):
[~vanzin] I have a wrapper custom script on top of spark-submit script,  which 
has certain security check before it calls the built-in spark-submit script to 
submit the job. I have to use this script to launch the spark job using a Java 
program. Do I have any other option, kindly help.

> SparkLauncher: Allow users to set their own submitter script instead of 
> hardcoded spark-submit
> --
>
> Key: SPARK-26190
> URL: https://issues.apache.org/jira/browse/SPARK-26190
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, Spark Core, Spark Submit
>Affects Versions: 2.1.0
> Environment: Apache Spark 2.0.1 on yarn cluster (MapR distribution)
>Reporter: Gyanendra Dwivedi
>Priority: Major
>
> Currently the script name is hard-coded in the 'createBuilder()' method of 
> org.apache.spark.launcher.SparkLauncher class:
> {code:java}
> // code placeholder
> private ProcessBuilder createBuilder() {
> List cmd = new ArrayList();
> String script = CommandBuilderUtils.isWindows() ? "spark-submit.cmd" : 
> "spark-submit";
> cmd.add(CommandBuilderUtils.join(File.separator, new 
> String[]{this.builder.getSparkHome(), "bin", script}));
> cmd.addAll(this.builder.buildSparkSubmitArgs());
> ..
> ..
> }{code}
>  
>  
> It has following issues, which prevents its usage in certain scenario. 
> 1) Developer may not use their own custom scripts with different name. They 
> are forced to use the one shipped with the installation. Overwriting that may 
> not be the option, when it is not allowed to alter the original installation.
> 2) The code expect the script to be present at "SPARK_HOME/bin" folder. 
> 3) The 'createBuilder()' method is private and hence, extending the 
> 'org.apache.spark.launcher.SparkLauncher' is not an option.
>  
> Proposed solution:
> 1) Developer should be given option to set their own custom script, which may 
> be located at any path.
> 2) Only in case the parameter is not set, the default should be taken from 
> SPARK_HOME/bin folder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26190) SparkLauncher: Allow users to set their own submitter script instead of hardcoded spark-submit

2018-11-27 Thread Gyanendra Dwivedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gyanendra Dwivedi updated SPARK-26190:
--
Description: 
Currently the script name is hard-coded in the 'createBuilder()' method of 
org.apache.spark.launcher.SparkLauncher class:
{code:java}
// code placeholder

private ProcessBuilder createBuilder() {
List cmd = new ArrayList();
String script = CommandBuilderUtils.isWindows() ? "spark-submit.cmd" : 
"spark-submit";
cmd.add(CommandBuilderUtils.join(File.separator, new 
String[]{this.builder.getSparkHome(), "bin", script}));
cmd.addAll(this.builder.buildSparkSubmitArgs());
..
..
}{code}
 

 

It has following issues, which prevents its usage in certain scenario. 

1) Developer may not use their own custom scripts with different name. They are 
forced to use the one shipped with the installation. Overwriting that may not 
be the option, when it is not allowed to alter the original installation.

2) The code expect the script to be present at "SPARK_HOME/bin" folder. 

3) The 'createBuilder()' method is private and hence, extending the 
'org.apache.spark.launcher.SparkLauncher' is not an option.

 

Proposed solution:

1) Developer should be given option to set their own custom script, which may 
be located at any path.

2) Only in case the parameter is not set, the default should be taken from 
SPARK_HOME/bin folder.

  was:
Currently the script name is hard-coded in the 'createBuilder()' method of 
org.apache.spark.launcher.SparkLauncher class:

 

private ProcessBuilder createBuilder() {
 List cmd = new ArrayList();
 String script = CommandBuilderUtils.isWindows() ? "*spark-submit.cmd*" : 
"*spark-submit*";
 cmd.add(CommandBuilderUtils.join(File.separator, new 
String[]\{this.builder.getSparkHome(), "bin", script}));
 cmd.addAll(this.builder.buildSparkSubmitArgs());



.

}

 

It has following issues, which prevents its usage in certain scenario. 

1) Developer may not use their own custom scripts with different name. They are 
forced to use the one shipped with the installation. Overwriting that may not 
be the option, when it is not allowed to alter the original installation.

2) The code expect the script to be present at "SPARK_HOME/bin" folder. 

3) The 'createBuilder()' method is private and hence, extending the 
'org.apache.spark.launcher.SparkLauncher' is not an option.

 

Proposed solution:

1) Developer should be given option to set their own custom script, which may 
be located at any path.

2) Only in case the parameter is not set, the default should be taken from 
SPARK_HOME/bin folder.


> SparkLauncher: Allow users to set their own submitter script instead of 
> hardcoded spark-submit
> --
>
> Key: SPARK-26190
> URL: https://issues.apache.org/jira/browse/SPARK-26190
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, Spark Core, Spark Submit
>Affects Versions: 2.1.0
> Environment: Apache Spark 2.0.1 on yarn cluster (MapR distribution)
>Reporter: Gyanendra Dwivedi
>Priority: Major
>
> Currently the script name is hard-coded in the 'createBuilder()' method of 
> org.apache.spark.launcher.SparkLauncher class:
> {code:java}
> // code placeholder
> private ProcessBuilder createBuilder() {
> List cmd = new ArrayList();
> String script = CommandBuilderUtils.isWindows() ? "spark-submit.cmd" : 
> "spark-submit";
> cmd.add(CommandBuilderUtils.join(File.separator, new 
> String[]{this.builder.getSparkHome(), "bin", script}));
> cmd.addAll(this.builder.buildSparkSubmitArgs());
> ..
> ..
> }{code}
>  
>  
> It has following issues, which prevents its usage in certain scenario. 
> 1) Developer may not use their own custom scripts with different name. They 
> are forced to use the one shipped with the installation. Overwriting that may 
> not be the option, when it is not allowed to alter the original installation.
> 2) The code expect the script to be present at "SPARK_HOME/bin" folder. 
> 3) The 'createBuilder()' method is private and hence, extending the 
> 'org.apache.spark.launcher.SparkLauncher' is not an option.
>  
> Proposed solution:
> 1) Developer should be given option to set their own custom script, which may 
> be located at any path.
> 2) Only in case the parameter is not set, the default should be taken from 
> SPARK_HOME/bin folder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26190) SparkLauncher: Allow users to set their own submitter script instead of hardcoded spark-submit

2018-11-27 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700890#comment-16700890
 ] 

Marcelo Vanzin commented on SPARK-26190:


Based solely on what you wrote here, I'm leaning towards closing this.

SparkLauncher is a programatic API around spark-submit, not around your custom 
script. If you have a custom script, you can call it without using 
SparkLauncher.

> SparkLauncher: Allow users to set their own submitter script instead of 
> hardcoded spark-submit
> --
>
> Key: SPARK-26190
> URL: https://issues.apache.org/jira/browse/SPARK-26190
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, Spark Core, Spark Submit
>Affects Versions: 2.1.0
> Environment: Apache Spark 2.0.1 on yarn cluster (MapR distribution)
>Reporter: Gyanendra Dwivedi
>Priority: Major
>
> Currently the script name is hard-coded in the 'createBuilder()' method of 
> org.apache.spark.launcher.SparkLauncher class:
>  
> private ProcessBuilder createBuilder() {
>  List cmd = new ArrayList();
>  String script = CommandBuilderUtils.isWindows() ? "*spark-submit.cmd*" : 
> "*spark-submit*";
>  cmd.add(CommandBuilderUtils.join(File.separator, new 
> String[]\{this.builder.getSparkHome(), "bin", script}));
>  cmd.addAll(this.builder.buildSparkSubmitArgs());
> 
> .
> }
>  
> It has following issues, which prevents its usage in certain scenario. 
> 1) Developer may not use their own custom scripts with different name. They 
> are forced to use the one shipped with the installation. Overwriting that may 
> not be the option, when it is not allowed to alter the original installation.
> 2) The code expect the script to be present at "SPARK_HOME/bin" folder. 
> 3) The 'createBuilder()' method is private and hence, extending the 
> 'org.apache.spark.launcher.SparkLauncher' is not an option.
>  
> Proposed solution:
> 1) Developer should be given option to set their own custom script, which may 
> be located at any path.
> 2) Only in case the parameter is not set, the default should be taken from 
> SPARK_HOME/bin folder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26188) Spark 2.4.0 Partitioning behavior breaks backwards compatibility

2018-11-27 Thread Damien Doucet-Girard (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Damien Doucet-Girard updated SPARK-26188:
-
Description: 
My team uses spark to partition and output parquet files to amazon S3. We 
typically use 256 partitions, from 00 to ff.

We've observed that in spark 2.3.2 and prior, it reads the partitions as 
strings by default. However, in spark 2.4.0 and later, the type of each 
partition is inferred by default, and partitions such as 00 become 0 and 4d 
become 4.0.
 Here is a log sample of this behavior from one of our jobs:
 2.4.0:
{code:java}
18/11/27 14:02:27 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=00/part-00061-hashredacted.parquet, 
range: 0-662, partition values: [0]
18/11/27 14:02:28 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=ef/part-00034-hashredacted.parquet, 
range: 0-662, partition values: [ef]
18/11/27 14:02:29 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=4a/part-00151-hashredacted.parquet, 
range: 0-662, partition values: [4a]
18/11/27 14:02:30 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=74/part-00180-hashredacted.parquet, 
range: 0-662, partition values: [74]
18/11/27 14:02:32 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=f5/part-00156-hashredacted.parquet, 
range: 0-662, partition values: [f5]
18/11/27 14:02:33 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=50/part-00195-hashredacted.parquet, 
range: 0-662, partition values: [50]
18/11/27 14:02:34 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=70/part-00054-hashredacted.parquet, 
range: 0-662, partition values: [70]
18/11/27 14:02:35 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=b9/part-00012-hashredacted.parquet, 
range: 0-662, partition values: [b9]
18/11/27 14:02:37 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=d2/part-00016-hashredacted.parquet, 
range: 0-662, partition values: [d2]
18/11/27 14:02:38 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=51/part-3-hashredacted.parquet, 
range: 0-662, partition values: [51]
18/11/27 14:02:39 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=84/part-00135-hashredacted.parquet, 
range: 0-662, partition values: [84]
18/11/27 14:02:40 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=b5/part-00190-hashredacted.parquet, 
range: 0-662, partition values: [b5]
18/11/27 14:02:41 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=88/part-00143-hashredacted.parquet, 
range: 0-662, partition values: [88]
18/11/27 14:02:42 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=4d/part-00120-hashredacted.parquet, 
range: 0-662, partition values: [4.0]
18/11/27 14:02:43 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=ac/part-00119-hashredacted.parquet, 
range: 0-662, partition values: [ac]
18/11/27 14:02:44 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=24/part-00139-hashredacted.parquet, 
range: 0-662, partition values: [24]
18/11/27 14:02:45 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=fd/part-00167-hashredacted.parquet, 
range: 0-662, partition values: [fd]
18/11/27 14:02:46 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=52/part-00033-hashredacted.parquet, 
range: 0-662, partition values: [52]
18/11/27 14:02:47 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=ab/part-00083-hashredacted.parquet, 
range: 0-662, partition values: [ab]
18/11/27 14:02:48 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=f8/part-00018-hashredacted.parquet, 
range: 0-662, partition values: [f8]
18/11/27 14:02:49 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=7a/part-00093-hashredacted.parquet, 
range: 0-662, partition values: [7a]
18/11/27 14:02:50 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=ba/part-00020-hashredacted.parquet, 
range: 0-662, partition values: [ba]
18/11/27 14:02:51 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=2d/part-00085-hashredacted.parquet, 
range: 0-662, partition values: [2.0]
18/11/27 14:02:52 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=03/part-00099-hashredacted.parquet, 
range: 0-662, partition values: [3]
18/11/27 14:02:53 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=57/part-00196-hashredacted.parquet, 
range: 0-662, partition values: [57]
18/11/27 14:02:54 INFO FileSc

[jira] [Created] (SPARK-26190) SparkLauncher: Allow users to set their own submitter script instead of hardcoded spark-submit

2018-11-27 Thread Gyanendra Dwivedi (JIRA)
Gyanendra Dwivedi created SPARK-26190:
-

 Summary: SparkLauncher: Allow users to set their own submitter 
script instead of hardcoded spark-submit
 Key: SPARK-26190
 URL: https://issues.apache.org/jira/browse/SPARK-26190
 Project: Spark
  Issue Type: Improvement
  Components: Java API, Spark Core, Spark Submit
Affects Versions: 2.1.0
 Environment: Apache Spark 2.0.1 on yarn cluster (MapR distribution)
Reporter: Gyanendra Dwivedi


Currently the script name is hard-coded in the 'createBuilder()' method of 
org.apache.spark.launcher.SparkLauncher class:

 

private ProcessBuilder createBuilder() {
 List cmd = new ArrayList();
 String script = CommandBuilderUtils.isWindows() ? "*spark-submit.cmd*" : 
"*spark-submit*";
 cmd.add(CommandBuilderUtils.join(File.separator, new 
String[]\{this.builder.getSparkHome(), "bin", script}));
 cmd.addAll(this.builder.buildSparkSubmitArgs());



.

}

 

It has following issues, which prevents its usage in certain scenario. 

1) Developer may not use their own custom scripts with different name. They are 
forced to use the one shipped with the installation. Overwriting that may not 
be the option, when it is not allowed to alter the original installation.

2) The code expect the script to be present at "SPARK_HOME/bin" folder. 

3) The 'createBuilder()' method is private and hence, extending the 
'org.apache.spark.launcher.SparkLauncher' is not an option.

 

Proposed solution:

1) Developer should be given option to set their own custom script, which may 
be located at any path.

2) Only in case the parameter is not set, the default should be taken from 
SPARK_HOME/bin folder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26188) Spark 2.4.0 Partitioning behavior breaks backwards compatibility

2018-11-27 Thread Damien Doucet-Girard (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Damien Doucet-Girard updated SPARK-26188:
-
Summary: Spark 2.4.0 Partitioning behavior breaks backwards compatibility  
(was: Spark 2.4.0 behavior breaks backwards compatibility)

> Spark 2.4.0 Partitioning behavior breaks backwards compatibility
> 
>
> Key: SPARK-26188
> URL: https://issues.apache.org/jira/browse/SPARK-26188
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Damien Doucet-Girard
>Priority: Minor
>
> My team uses spark to partition and output parquet files to amazon S3. We 
> typically use 256 partitions, from 00 to ff.
> We've observed that in spark 2.3.2 and prior, it reads the partitions as 
> strings by default. However, in spark 2.4.0 and later, the type of each 
> partition is inferred by default, and partitions such as 00 become 0 and 4d 
> become 4.0.
> Here is a log sample of this behavior from one of our jobs:
> 2.4.0:
> {code:java}
> 18/11/27 14:02:27 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=00/part-00061-hashredacted.parquet, 
> range: 0-662, partition values: [0]
> 18/11/27 14:02:28 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ef/part-00034-hashredacted.parquet, 
> range: 0-662, partition values: [ef]
> 18/11/27 14:02:29 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=4a/part-00151-hashredacted.parquet, 
> range: 0-662, partition values: [4a]
> 18/11/27 14:02:30 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=74/part-00180-hashredacted.parquet, 
> range: 0-662, partition values: [74]
> 18/11/27 14:02:32 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=f5/part-00156-hashredacted.parquet, 
> range: 0-662, partition values: [f5]
> 18/11/27 14:02:33 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=50/part-00195-hashredacted.parquet, 
> range: 0-662, partition values: [50]
> 18/11/27 14:02:34 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=70/part-00054-hashredacted.parquet, 
> range: 0-662, partition values: [70]
> 18/11/27 14:02:35 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=b9/part-00012-hashredacted.parquet, 
> range: 0-662, partition values: [b9]
> 18/11/27 14:02:37 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=d2/part-00016-hashredacted.parquet, 
> range: 0-662, partition values: [d2]
> 18/11/27 14:02:38 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=51/part-3-hashredacted.parquet, 
> range: 0-662, partition values: [51]
> 18/11/27 14:02:39 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=84/part-00135-hashredacted.parquet, 
> range: 0-662, partition values: [84]
> 18/11/27 14:02:40 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=b5/part-00190-hashredacted.parquet, 
> range: 0-662, partition values: [b5]
> 18/11/27 14:02:41 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=88/part-00143-hashredacted.parquet, 
> range: 0-662, partition values: [88]
> 18/11/27 14:02:42 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=4d/part-00120-hashredacted.parquet, 
> range: 0-662, partition values: [4.0]
> 18/11/27 14:02:43 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ac/part-00119-hashredacted.parquet, 
> range: 0-662, partition values: [ac]
> 18/11/27 14:02:44 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=24/part-00139-hashredacted.parquet, 
> range: 0-662, partition values: [24]
> 18/11/27 14:02:45 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=fd/part-00167-hashredacted.parquet, 
> range: 0-662, partition values: [fd]
> 18/11/27 14:02:46 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=52/part-00033-hashredacted.parquet, 
> range: 0-662, partition values: [52]
> 18/11/27 14:02:47 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ab/part-00083-hashredacted.parquet, 
> range: 0-662, partition values: [ab]
> 18/11/27 14:02:48 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=f8/part-00018-hashredacted.parquet, 
> range: 0-662, partition values: [f8]
> 18/11/27 14:02:49 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=7a/part-00093-hashredacted.parquet, 
> range: 0-662, partition values: [7a]
> 18/11/27 14:02

[jira] [Updated] (SPARK-26188) Spark 2.4.0 behavior breaks backwards compatibility

2018-11-27 Thread Damien Doucet-Girard (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Damien Doucet-Girard updated SPARK-26188:
-
Description: 
My team uses spark to partition and output parquet files to amazon S3. We 
typically use 256 partitions, from 00 to ff.

We've observed that in spark 2.3.2 and prior, it reads the partitions as 
strings by default. However, in spark 2.4.0 and later, the type of each 
partition is inferred by default, and partitions such as 00 become 0 and 4d 
become 4.0.
Here is a log sample of this behavior from one of our jobs:
2.4.0:
{code:java}
18/11/27 14:02:27 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=00/part-00061-hashredacted.parquet, 
range: 0-662, partition values: [0]
18/11/27 14:02:28 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=ef/part-00034-hashredacted.parquet, 
range: 0-662, partition values: [ef]
18/11/27 14:02:29 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=4a/part-00151-hashredacted.parquet, 
range: 0-662, partition values: [4a]
18/11/27 14:02:30 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=74/part-00180-hashredacted.parquet, 
range: 0-662, partition values: [74]
18/11/27 14:02:32 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=f5/part-00156-hashredacted.parquet, 
range: 0-662, partition values: [f5]
18/11/27 14:02:33 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=50/part-00195-hashredacted.parquet, 
range: 0-662, partition values: [50]
18/11/27 14:02:34 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=70/part-00054-hashredacted.parquet, 
range: 0-662, partition values: [70]
18/11/27 14:02:35 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=b9/part-00012-hashredacted.parquet, 
range: 0-662, partition values: [b9]
18/11/27 14:02:37 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=d2/part-00016-hashredacted.parquet, 
range: 0-662, partition values: [d2]
18/11/27 14:02:38 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=51/part-3-hashredacted.parquet, 
range: 0-662, partition values: [51]
18/11/27 14:02:39 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=84/part-00135-hashredacted.parquet, 
range: 0-662, partition values: [84]
18/11/27 14:02:40 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=b5/part-00190-hashredacted.parquet, 
range: 0-662, partition values: [b5]
18/11/27 14:02:41 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=88/part-00143-hashredacted.parquet, 
range: 0-662, partition values: [88]
18/11/27 14:02:42 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=4d/part-00120-hashredacted.parquet, 
range: 0-662, partition values: [4.0]
18/11/27 14:02:43 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=ac/part-00119-hashredacted.parquet, 
range: 0-662, partition values: [ac]
18/11/27 14:02:44 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=24/part-00139-hashredacted.parquet, 
range: 0-662, partition values: [24]
18/11/27 14:02:45 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=fd/part-00167-hashredacted.parquet, 
range: 0-662, partition values: [fd]
18/11/27 14:02:46 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=52/part-00033-hashredacted.parquet, 
range: 0-662, partition values: [52]
18/11/27 14:02:47 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=ab/part-00083-hashredacted.parquet, 
range: 0-662, partition values: [ab]
18/11/27 14:02:48 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=f8/part-00018-hashredacted.parquet, 
range: 0-662, partition values: [f8]
18/11/27 14:02:49 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=7a/part-00093-hashredacted.parquet, 
range: 0-662, partition values: [7a]
18/11/27 14:02:50 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=ba/part-00020-hashredacted.parquet, 
range: 0-662, partition values: [ba]
18/11/27 14:02:51 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=2d/part-00085-hashredacted.parquet, 
range: 0-662, partition values: [2.0]
18/11/27 14:02:52 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=03/part-00099-hashredacted.parquet, 
range: 0-662, partition values: [3]
18/11/27 14:02:53 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=57/part-00196-hashredacted.parquet, 
range: 0-662, partition values: [57]
18/11/27 14:02:54 INFO FileScan

[jira] [Updated] (SPARK-26188) Spark 2.4.0 behavior breaks backwards compatibility

2018-11-27 Thread Damien Doucet-Girard (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Damien Doucet-Girard updated SPARK-26188:
-
Description: 
My team uses spark to partition and output parquet files to amazon S3. We 
typically use 256 partitions, from 00 to ff.

We've observed that in spark 2.3.2 and prior, it reads the partitions as 
strings by default. However, in spark 2.4.0 and later, the type of each 
partition is inferred by default, and partitions such as 00 become 0 and 4d 
become 4.0.

After some investigation, we've isolated the issue to
 
[https://github.com/apache/spark/blob/02b510728c31b70e6035ad541bfcdc2b59dcd79a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L132-L136]
  

In the inferPartitioning method, 2.3.2 sets the type inference to false by 
default:
{code:java}
val spec = PartitioningUtils.parsePartitions(
  leafDirs,
  typeInference = false,
  basePaths = basePaths,
  timeZoneId = timeZoneId){code}
However, in version 2.4.0, the typeInference flag has been replace with a 
config flag

[https://github.com/apache/spark/blob/075447b3965489ffba4e6afb2b120880bc307505/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L129-L133]

 
{code:java}
val inferredPartitionSpec = PartitioningUtils.parsePartitions(
  leafDirs,
  typeInference = 
sparkSession.sessionState.conf.partitionColumnTypeInferenceEnabled,
  basePaths = basePaths,
  timeZoneId = timeZoneId){code}
And this conf's default value is true
{code:java}
val PARTITION_COLUMN_TYPE_INFERENCE =
buildConf("spark.sql.sources.partitionColumnTypeInference.enabled")
.doc("When true, automatically infer the data types for partitioned columns.")
.booleanConf
.createWithDefault(true){code}
[https://github.com/apache/spark/blob/075447b3965489ffba4e6afb2b120880bc307505/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L636-L640]
  

I was wondering if a bug report would be appropriate to preserve backwards 
compatibility and change the default conf value to false.

 
  

  was:
My team uses spark to partition and output parquet files to amazon S3. We 
typically use 256 partitions, from 00 to ff.

We've observed that in spark 2.3.2 and prior, it reads the partitions as 
strings by default. However, in spark 2.4.0 and later, the type of each 
partition is inferred by default, and partitions such as 00 become 0 and 4d 
become 4.0.

After some investigation, we've isolated the issue to
 
[https://github.com/apache/spark/blob/02b510728c31b70e6035ad541bfcdc2b59dcd79a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L132-L136]
  

In the inferPartitioning method, 2.3.2 sets the type inference to false by 
default (lines 132-136):
{code:java}
val spec = PartitioningUtils.parsePartitions(
  leafDirs,
  typeInference = false,
  basePaths = basePaths,
  timeZoneId = timeZoneId){code}
However, in version 2.4.0, the typeInference flag has been replace with a 
config flag

https://github.com/apache/spark/blob/075447b3965489ffba4e6afb2b120880bc307505/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L129-L133

 
  
{code:java}
val inferredPartitionSpec = PartitioningUtils.parsePartitions(
  leafDirs,
  typeInference = 
sparkSession.sessionState.conf.partitionColumnTypeInferenceEnabled,
  basePaths = basePaths,
  timeZoneId = timeZoneId){code}
And this conf's default value is true
{code:java}
val PARTITION_COLUMN_TYPE_INFERENCE =
buildConf("spark.sql.sources.partitionColumnTypeInference.enabled")
.doc("When true, automatically infer the data types for partitioned columns.")
.booleanConf
.createWithDefault(true){code}
[https://github.com/apache/spark/blob/075447b3965489ffba4e6afb2b120880bc307505/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L636-L640]
  

I was wondering if a bug report would be appropriate to preserve backwards 
compatibility and change the default conf value to false.

 
  


> Spark 2.4.0 behavior breaks backwards compatibility
> ---
>
> Key: SPARK-26188
> URL: https://issues.apache.org/jira/browse/SPARK-26188
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Damien Doucet-Girard
>Priority: Minor
>
> My team uses spark to partition and output parquet files to amazon S3. We 
> typically use 256 partitions, from 00 to ff.
> We've observed that in spark 2.3.2 and prior, it reads the partitions as 
> strings by default. However, in spark 2.4.0 and later, the type of each 
> partition is inferred by default, and partitions such as 00 become 0 and 4d 
> become 4.0.
> After some investigation, we've isolated the issue to
>  
> [https://g

[jira] [Commented] (SPARK-26189) Fix the doc of unionAll in SparkR

2018-11-27 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700880#comment-16700880
 ] 

Xiao Li commented on SPARK-26189:
-

cc [~huaxingao]

> Fix the doc of unionAll in SparkR
> -
>
> Key: SPARK-26189
> URL: https://issues.apache.org/jira/browse/SPARK-26189
> Project: Spark
>  Issue Type: Documentation
>  Components: R
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> We should fix the doc of unionAll in SparkR. See the discussion: 
> https://github.com/apache/spark/pull/23131/files#r236760822



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26189) Fix the doc of unionAll in SparkR

2018-11-27 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-26189:

Description: We should fix the doc of unionAll in SparkR. See the 
discussion: https://github.com/apache/spark/pull/23131/files#r236760822

> Fix the doc of unionAll in SparkR
> -
>
> Key: SPARK-26189
> URL: https://issues.apache.org/jira/browse/SPARK-26189
> Project: Spark
>  Issue Type: Documentation
>  Components: R
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> We should fix the doc of unionAll in SparkR. See the discussion: 
> https://github.com/apache/spark/pull/23131/files#r236760822



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26189) Fix the doc of unionAll in SparkR

2018-11-27 Thread Xiao Li (JIRA)
Xiao Li created SPARK-26189:
---

 Summary: Fix the doc of unionAll in SparkR
 Key: SPARK-26189
 URL: https://issues.apache.org/jira/browse/SPARK-26189
 Project: Spark
  Issue Type: Documentation
  Components: R
Affects Versions: 3.0.0
Reporter: Xiao Li






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26188) Spark 2.4.0 behavior breaks backwards compatibility

2018-11-27 Thread Damien Doucet-Girard (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Damien Doucet-Girard updated SPARK-26188:
-
Description: 
My team uses spark to partition and output parquet files to amazon S3. We 
typically use 256 partitions, from 00 to ff.

We've observed that in spark 2.3.2 and prior, it reads the partitions as 
strings by default. However, in spark 2.4.0 and later, the type of each 
partition is inferred by default, and partitions such as 00 become 0 and 4d 
become 4.0.

After some investigation, we've isolated the issue to
 
[https://github.com/apache/spark/blob/02b510728c31b70e6035ad541bfcdc2b59dcd79a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L132-L136]
  

In the inferPartitioning method, 2.3.2 sets the type inference to false by 
default (lines 132-136):
{code:java}
val spec = PartitioningUtils.parsePartitions(
  leafDirs,
  typeInference = false,
  basePaths = basePaths,
  timeZoneId = timeZoneId){code}

 However, in version 2.4.0, the typeInference flag has been replace with a 
config flag
 
[https://github.com/apache/spark/blob/075447b3965489ffba4e6afb2b120880bc307505/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L129-L133
 
 
{code:java}
val inferredPartitionSpec = PartitioningUtils.parsePartitions(
  leafDirs,
  typeInference = 
sparkSession.sessionState.conf.partitionColumnTypeInferenceEnabled,
  basePaths = basePaths,
  timeZoneId = timeZoneId){code}
And this conf's default value is true
{code:java}
val PARTITION_COLUMN_TYPE_INFERENCE =
buildConf("spark.sql.sources.partitionColumnTypeInference.enabled")
.doc("When true, automatically infer the data types for partitioned columns.")
.booleanConf
.createWithDefault(true){code}

 
[https://github.com/apache/spark/blob/075447b3965489ffba4e6afb2b120880bc307505/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L636-L640]
  

I was wondering if a bug report would be appropriate to preserve backwards 
compatibility and change the default conf value to false.

 
  

  was:
My team uses spark to partition and output parquet files to amazon S3. We 
typically use 256 partitions, from 00 to ff.

We've observed that in spark 2.3.2 and prior, it reads the partitions as 
strings by default. However, in spark 2.4.0 and later, the type of each 
partition is inferred by default, and partitions such as 00 become 0 and 4d 
become 4.0.

After some investigation, we've isolated the issue to
[https://github.com/apache/spark/blob/02b510728c31b70e6035ad541bfcdc2b59dcd79a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L132-L136]
 

 In the inferPartitioning method, 2.3.2 sets the type inference to false by 
default (lines 132-136):

```
{color:#cc7832}val {color}spec = PartitioningUtils.parsePartitions(
 leafDirs{color:#cc7832},
{color} typeInference = {color:#cc7832}false{color}{color:#cc7832},
{color} basePaths = basePaths{color:#cc7832},
{color} timeZoneId = timeZoneId)
```
However, in version 2.4.0, the typeInference flag has been replace with a 
config flag
 
[https://github.com/apache/spark/blob/075447b3965489ffba4e6afb2b120880bc307505/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L129-L133
 
|https://github.com/apache/spark/blob/075447b3965489ffba4e6afb2b120880bc307505/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L129-L133]
 
```
{color:#cc7832}val {color}inferredPartitionSpec = 
PartitioningUtils.parsePartitions(leafDirs{color:#cc7832},
{color} typeInference = 
sparkSession.sessionState.conf.partitionColumnTypeInferenceEnabled{color:#cc7832},
{color} basePaths = basePaths{color:#cc7832},
{color} timeZoneId = timeZoneId)
```

And this conf's default value is true

```
{color:#cc7832}val {color}PARTITION_COLUMN_TYPE_INFERENCE =
 
buildConf({color:#6a8759}"spark.sql.sources.partitionColumnTypeInference.enabled"{color})
 .doc({color:#6a8759}"When true, automatically infer the data types for 
partitioned columns."{color})
 .booleanConf
 .createWithDefault({color:#cc7832}true{color})
```
[https://github.com/apache/spark/blob/075447b3965489ffba4e6afb2b120880bc307505/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L636-L640]
 

 

I was wondering if a bug report would be appropriate to preserve backwards 
compatibility and change the default conf value to false.

 
 


> Spark 2.4.0 behavior breaks backwards compatibility
> ---
>
> Key: SPARK-26188
> URL: https://issues.apache.org/jira/browse/SPARK-26188
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Damien Doucet-Girard
>   

[jira] [Updated] (SPARK-26188) Spark 2.4.0 behavior breaks backwards compatibility

2018-11-27 Thread Damien Doucet-Girard (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Damien Doucet-Girard updated SPARK-26188:
-
Description: 
My team uses spark to partition and output parquet files to amazon S3. We 
typically use 256 partitions, from 00 to ff.

We've observed that in spark 2.3.2 and prior, it reads the partitions as 
strings by default. However, in spark 2.4.0 and later, the type of each 
partition is inferred by default, and partitions such as 00 become 0 and 4d 
become 4.0.

After some investigation, we've isolated the issue to
 
[https://github.com/apache/spark/blob/02b510728c31b70e6035ad541bfcdc2b59dcd79a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L132-L136]
  

In the inferPartitioning method, 2.3.2 sets the type inference to false by 
default (lines 132-136):
{code:java}
val spec = PartitioningUtils.parsePartitions(
  leafDirs,
  typeInference = false,
  basePaths = basePaths,
  timeZoneId = timeZoneId){code}
However, in version 2.4.0, the typeInference flag has been replace with a 
config flag

https://github.com/apache/spark/blob/075447b3965489ffba4e6afb2b120880bc307505/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L129-L133

 
  
{code:java}
val inferredPartitionSpec = PartitioningUtils.parsePartitions(
  leafDirs,
  typeInference = 
sparkSession.sessionState.conf.partitionColumnTypeInferenceEnabled,
  basePaths = basePaths,
  timeZoneId = timeZoneId){code}
And this conf's default value is true
{code:java}
val PARTITION_COLUMN_TYPE_INFERENCE =
buildConf("spark.sql.sources.partitionColumnTypeInference.enabled")
.doc("When true, automatically infer the data types for partitioned columns.")
.booleanConf
.createWithDefault(true){code}
[https://github.com/apache/spark/blob/075447b3965489ffba4e6afb2b120880bc307505/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L636-L640]
  

I was wondering if a bug report would be appropriate to preserve backwards 
compatibility and change the default conf value to false.

 
  

  was:
My team uses spark to partition and output parquet files to amazon S3. We 
typically use 256 partitions, from 00 to ff.

We've observed that in spark 2.3.2 and prior, it reads the partitions as 
strings by default. However, in spark 2.4.0 and later, the type of each 
partition is inferred by default, and partitions such as 00 become 0 and 4d 
become 4.0.

After some investigation, we've isolated the issue to
 
[https://github.com/apache/spark/blob/02b510728c31b70e6035ad541bfcdc2b59dcd79a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L132-L136]
  

In the inferPartitioning method, 2.3.2 sets the type inference to false by 
default (lines 132-136):
{code:java}
val spec = PartitioningUtils.parsePartitions(
  leafDirs,
  typeInference = false,
  basePaths = basePaths,
  timeZoneId = timeZoneId){code}

 However, in version 2.4.0, the typeInference flag has been replace with a 
config flag
 
[https://github.com/apache/spark/blob/075447b3965489ffba4e6afb2b120880bc307505/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L129-L133
 
 
{code:java}
val inferredPartitionSpec = PartitioningUtils.parsePartitions(
  leafDirs,
  typeInference = 
sparkSession.sessionState.conf.partitionColumnTypeInferenceEnabled,
  basePaths = basePaths,
  timeZoneId = timeZoneId){code}
And this conf's default value is true
{code:java}
val PARTITION_COLUMN_TYPE_INFERENCE =
buildConf("spark.sql.sources.partitionColumnTypeInference.enabled")
.doc("When true, automatically infer the data types for partitioned columns.")
.booleanConf
.createWithDefault(true){code}

 
[https://github.com/apache/spark/blob/075447b3965489ffba4e6afb2b120880bc307505/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L636-L640]
  

I was wondering if a bug report would be appropriate to preserve backwards 
compatibility and change the default conf value to false.

 
  


> Spark 2.4.0 behavior breaks backwards compatibility
> ---
>
> Key: SPARK-26188
> URL: https://issues.apache.org/jira/browse/SPARK-26188
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Damien Doucet-Girard
>Priority: Minor
>
> My team uses spark to partition and output parquet files to amazon S3. We 
> typically use 256 partitions, from 00 to ff.
> We've observed that in spark 2.3.2 and prior, it reads the partitions as 
> strings by default. However, in spark 2.4.0 and later, the type of each 
> partition is inferred by default, and partitions such as 00 become 0 and 4d 
> become 4.0.
> After some investigation, we've isolated the iss

[jira] [Updated] (SPARK-26188) Spark 2.4.0 behavior breaks backwards compatibility

2018-11-27 Thread Damien Doucet-Girard (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Damien Doucet-Girard updated SPARK-26188:
-
Description: 
My team uses spark to partition and output parquet files to amazon S3. We 
typically use 256 partitions, from 00 to ff.

We've observed that in spark 2.3.2 and prior, it reads the partitions as 
strings by default. However, in spark 2.4.0 and later, the type of each 
partition is inferred by default, and partitions such as 00 become 0 and 4d 
become 4.0.

After some investigation, we've isolated the issue to
[https://github.com/apache/spark/blob/02b510728c31b70e6035ad541bfcdc2b59dcd79a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L132-L136]
 

 In the inferPartitioning method, 2.3.2 sets the type inference to false by 
default (lines 132-136):

```
{color:#cc7832}val {color}spec = PartitioningUtils.parsePartitions(
 leafDirs{color:#cc7832},
{color} typeInference = {color:#cc7832}false{color}{color:#cc7832},
{color} basePaths = basePaths{color:#cc7832},
{color} timeZoneId = timeZoneId)
```
However, in version 2.4.0, the typeInference flag has been replace with a 
config flag
 
[https://github.com/apache/spark/blob/075447b3965489ffba4e6afb2b120880bc307505/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L129-L133
 
|https://github.com/apache/spark/blob/075447b3965489ffba4e6afb2b120880bc307505/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L129-L133]
 
```
{color:#cc7832}val {color}inferredPartitionSpec = 
PartitioningUtils.parsePartitions(leafDirs{color:#cc7832},
{color} typeInference = 
sparkSession.sessionState.conf.partitionColumnTypeInferenceEnabled{color:#cc7832},
{color} basePaths = basePaths{color:#cc7832},
{color} timeZoneId = timeZoneId)
```

And this conf's default value is true

```
{color:#cc7832}val {color}PARTITION_COLUMN_TYPE_INFERENCE =
 
buildConf({color:#6a8759}"spark.sql.sources.partitionColumnTypeInference.enabled"{color})
 .doc({color:#6a8759}"When true, automatically infer the data types for 
partitioned columns."{color})
 .booleanConf
 .createWithDefault({color:#cc7832}true{color})
```
[https://github.com/apache/spark/blob/075447b3965489ffba4e6afb2b120880bc307505/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L636-L640]
 

 

I was wondering if a bug report would be appropriate to preserve backwards 
compatibility and change the default conf value to false.

 
 

> Spark 2.4.0 behavior breaks backwards compatibility
> ---
>
> Key: SPARK-26188
> URL: https://issues.apache.org/jira/browse/SPARK-26188
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Damien Doucet-Girard
>Priority: Minor
>
> My team uses spark to partition and output parquet files to amazon S3. We 
> typically use 256 partitions, from 00 to ff.
> We've observed that in spark 2.3.2 and prior, it reads the partitions as 
> strings by default. However, in spark 2.4.0 and later, the type of each 
> partition is inferred by default, and partitions such as 00 become 0 and 4d 
> become 4.0.
> After some investigation, we've isolated the issue to
> [https://github.com/apache/spark/blob/02b510728c31b70e6035ad541bfcdc2b59dcd79a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L132-L136]
>  
>  In the inferPartitioning method, 2.3.2 sets the type inference to false by 
> default (lines 132-136):
> ```
> {color:#cc7832}val {color}spec = PartitioningUtils.parsePartitions(
>  leafDirs{color:#cc7832},
> {color} typeInference = {color:#cc7832}false{color}{color:#cc7832},
> {color} basePaths = basePaths{color:#cc7832},
> {color} timeZoneId = timeZoneId)
> ```
> However, in version 2.4.0, the typeInference flag has been replace with a 
> config flag
>  
> [https://github.com/apache/spark/blob/075447b3965489ffba4e6afb2b120880bc307505/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L129-L133
>  
> |https://github.com/apache/spark/blob/075447b3965489ffba4e6afb2b120880bc307505/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L129-L133]
>  
> ```
> {color:#cc7832}val {color}inferredPartitionSpec = 
> PartitioningUtils.parsePartitions(leafDirs{color:#cc7832},
> {color} typeInference = 
> sparkSession.sessionState.conf.partitionColumnTypeInferenceEnabled{color:#cc7832},
> {color} basePaths = basePaths{color:#cc7832},
> {color} timeZoneId = timeZoneId)
> ```
> And this conf's default value is true
> ```
> {color:#cc7832}val {color}PARTITION_COLUMN_TYPE_INFERENCE =
>  
> buildConf({color:#6a8759}"spark.sql.so

[jira] [Created] (SPARK-26188) Spark 2.4.0 behavior breaks backwards compatibility

2018-11-27 Thread Damien Doucet-Girard (JIRA)
Damien Doucet-Girard created SPARK-26188:


 Summary: Spark 2.4.0 behavior breaks backwards compatibility
 Key: SPARK-26188
 URL: https://issues.apache.org/jira/browse/SPARK-26188
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Damien Doucet-Girard






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26187) Stream-stream left outer join returns outer nulls for already matched rows

2018-11-27 Thread Pavel Chernikov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Chernikov updated SPARK-26187:

Description: 
This is basically the same issue as SPARK-26154, but with slightly easier 
reproducible and concrete example:
{code:java}
val rateStream = session.readStream
 .format("rate")
 .option("rowsPerSecond", 1)
 .option("numPartitions", 1)
 .load()

import org.apache.spark.sql.functions._

val fooStream = rateStream
 .select(col("value").as("fooId"), col("timestamp").as("fooTime"))

val barStream = rateStream
 // Introduce misses for ease of debugging
 .where(col("value") % 2 === 0)
 .select(col("value").as("barId"), col("timestamp").as("barTime")){code}
If barStream is configured to happen earlier than fooStream, based on time 
range condition, than everything is all right, no previously matched records 
are flushed with outer NULLs:
{code:java}
val query = fooStream
 .withWatermark("fooTime", "5 seconds")
 .join(
   barStream.withWatermark("barTime", "5 seconds"),
   expr("""
 barId = fooId AND
 fooTime >= barTime AND
 fooTime <= barTime + interval 5 seconds
"""),
   joinType = "leftOuter"
 )
 .writeStream
 .format("console")
 .option("truncate", false)
 .start(){code}
It's easy to observe that only odd rows are flushed with NULLs on the right:
{code:java}
[info] Batch: 1 
[info] +-+---+-+---+ 
[info] |fooId|fooTime                |barId|barTime                | 
[info] +-+---+-+---+ 
[info] |0    |2018-11-27 13:12:34.976|0    |2018-11-27 13:12:34.976| 
[info] |6    |2018-11-27 13:12:40.976|6    |2018-11-27 13:12:40.976| 
[info] |10   |2018-11-27 13:12:44.976|10   |2018-11-27 13:12:44.976| 
[info] |8    |2018-11-27 13:12:42.976|8    |2018-11-27 13:12:42.976| 
[info] |2    |2018-11-27 13:12:36.976|2    |2018-11-27 13:12:36.976| 
[info] |4    |2018-11-27 13:12:38.976|4    |2018-11-27 13:12:38.976| 
[info] +-+---+-+---+ 
[info] Batch: 2 
[info] +-+---+-+---+ 
[info] |fooId|fooTime                |barId|barTime                | 
[info] +-+---+-+---+ 
[info] |1    |2018-11-27 13:12:35.976|null |null                   | 
[info] |3    |2018-11-27 13:12:37.976|null |null                   | 
[info] |12   |2018-11-27 13:12:46.976|12   |2018-11-27 13:12:46.976| 
[info] |18   |2018-11-27 13:12:52.976|18   |2018-11-27 13:12:52.976| 
[info] |14   |2018-11-27 13:12:48.976|14   |2018-11-27 13:12:48.976| 
[info] |20   |2018-11-27 13:12:54.976|20   |2018-11-27 13:12:54.976| 
[info] |16   |2018-11-27 13:12:50.976|16   |2018-11-27 13:12:50.976| 
[info] +-+---+-+---+ 
[info] Batch: 3 
[info] +-+---+-+---+ 
[info] |fooId|fooTime                |barId|barTime                | 
[info] +-+---+-+---+ 
[info] |26   |2018-11-27 13:13:00.976|26   |2018-11-27 13:13:00.976| 
[info] |22   |2018-11-27 13:12:56.976|22   |2018-11-27 13:12:56.976| 
[info] |7    |2018-11-27 13:12:41.976|null |null                   | 
[info] |9    |2018-11-27 13:12:43.976|null |null                   | 
[info] |28   |2018-11-27 13:13:02.976|28   |2018-11-27 13:13:02.976| 
[info] |5    |2018-11-27 13:12:39.976|null |null                   | 
[info] |11   |2018-11-27 13:12:45.976|null |null                   | 
[info] |13   |2018-11-27 13:12:47.976|null |null                   | 
[info] |24   |2018-11-27 13:12:58.976|24   |2018-11-27 13:12:58.976| 
[info] +-+---+-+---+
{code}
On the other hand, if we switch the ordering and now fooStream is happening 
earlier based on time range condition:
{code:java}
val query = fooStream
 .withWatermark("fooTime", "5 seconds")
 .join(
   barStream.withWatermark("barTime", "5 seconds"),
   expr("""
 barId = fooId AND
 barTime >= fooTime AND
 barTime <= fooTime + interval 5 seconds
"""),
   joinType = "leftOuter"
 )
 .writeStream
 .format("console")
 .option("truncate", false)
 .start(){code}
Some, not all, previously matched records (with even IDs) are omitted with 
outer NULLs along with all unmatched records (with odd IDs): 
{code:java}
[info] Batch: 1 
[info] +-+---+-+---+ 
[info] |fooId|fooTime                |barId|barTime                | 
[info] +-+---+-+---+
[info] |0    |2018-11-27 13:26:11.463|0    |2018-11-27 13:26:11.463| 
[info] |6    |2018-11-27 13:26:17.463|6    |2018-11-27 13:26:17.463| 
[info] |10   |2018-11-27 13:26:21.463|10   |2018-11-27 13:26:21.463| 
[info] |8    |2018-11-27 13:26:19.463|8    |2018-11-27 13:26:19.463| 
[info] |2    |2

[jira] [Created] (SPARK-26187) Stream-stream left outer join returns outer nulls for already matched rows

2018-11-27 Thread Pavel Chernikov (JIRA)
Pavel Chernikov created SPARK-26187:
---

 Summary: Stream-stream left outer join returns outer nulls for 
already matched rows
 Key: SPARK-26187
 URL: https://issues.apache.org/jira/browse/SPARK-26187
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.3.2
Reporter: Pavel Chernikov


This is basically the same issue as 
[SPARK-26154|https://issues.apache.org/jira/browse/SPARK-26154], but with 
slightly easier reproducible and concrete example:



 
{code:java}
val rateStream = session.readStream
 .format("rate")
 .option("rowsPerSecond", 1)
 .option("numPartitions", 1)
 .load()

import org.apache.spark.sql.functions._

val fooStream = rateStream
 .select(col("value").as("fooId"), col("timestamp").as("fooTime"))

val barStream = rateStream
 // Introduce misses for ease of debugging
 .where(col("value") % 2 === 0)
 .select(col("value").as("barId"), col("timestamp").as("barTime")){code}
If barStream is configured to happen earlier than fooStream, based on time 
range condition, than everything is all right, no previously matched records 
are flushed with outer NULLs:

 
{code:java}
val query = fooStream
 .withWatermark("fooTime", "5 seconds")
 .join(
   barStream.withWatermark("barTime", "5 seconds"),
   expr("""
 barId = fooId AND
 fooTime >= barTime AND
 fooTime <= barTime + interval 5 seconds
"""),
   joinType = "leftOuter"
 )
 .writeStream
 .format("console")
 .option("truncate", false)
 .start(){code}
It's easy to observe that only odd rows are flushed with NULLs on the right:
h6.  
{code:java}
[info] Batch: 1 
[info] +-+---+-+---+ 
[info] |fooId|fooTime                |barId|barTime                | 
[info] +-+---+-+---+ 
[info] |0    |2018-11-27 13:12:34.976|0    |2018-11-27 13:12:34.976| 
[info] |6    |2018-11-27 13:12:40.976|6    |2018-11-27 13:12:40.976| 
[info] |10   |2018-11-27 13:12:44.976|10   |2018-11-27 13:12:44.976| 
[info] |8    |2018-11-27 13:12:42.976|8    |2018-11-27 13:12:42.976| 
[info] |2    |2018-11-27 13:12:36.976|2    |2018-11-27 13:12:36.976| 
[info] |4    |2018-11-27 13:12:38.976|4    |2018-11-27 13:12:38.976| 
[info] +-+---+-+---+ 
[info] Batch: 2 
[info] +-+---+-+---+ 
[info] |fooId|fooTime                |barId|barTime                | 
[info] +-+---+-+---+ 
[info] |1    |2018-11-27 13:12:35.976|null |null                   | 
[info] |3    |2018-11-27 13:12:37.976|null |null                   | 
[info] |12   |2018-11-27 13:12:46.976|12   |2018-11-27 13:12:46.976| 
[info] |18   |2018-11-27 13:12:52.976|18   |2018-11-27 13:12:52.976| 
[info] |14   |2018-11-27 13:12:48.976|14   |2018-11-27 13:12:48.976| 
[info] |20   |2018-11-27 13:12:54.976|20   |2018-11-27 13:12:54.976| 
[info] |16   |2018-11-27 13:12:50.976|16   |2018-11-27 13:12:50.976| 
[info] +-+---+-+---+ 
[info] Batch: 3 
[info] +-+---+-+---+ 
[info] |fooId|fooTime                |barId|barTime                | 
[info] +-+---+-+---+ 
[info] |26   |2018-11-27 13:13:00.976|26   |2018-11-27 13:13:00.976| 
[info] |22   |2018-11-27 13:12:56.976|22   |2018-11-27 13:12:56.976| 
[info] |7    |2018-11-27 13:12:41.976|null |null                   | 
[info] |9    |2018-11-27 13:12:43.976|null |null                   | 
[info] |28   |2018-11-27 13:13:02.976|28   |2018-11-27 13:13:02.976| 
[info] |5    |2018-11-27 13:12:39.976|null |null                   | 
[info] |11   |2018-11-27 13:12:45.976|null |null                   | 
[info] |13   |2018-11-27 13:12:47.976|null |null                   | 
[info] |24   |2018-11-27 13:12:58.976|24   |2018-11-27 13:12:58.976| 
[info] +-+---+-+---+
{code}
On the other hand, if we switch the ordering and now fooStream is happening 
earlier based on time range condition:

 
{code:java}
val query = fooStream
 .withWatermark("fooTime", "5 seconds")
 .join(
   barStream.withWatermark("barTime", "5 seconds"),
   expr("""
 barId = fooId AND
 barTime >= fooTime AND
 barTime <= fooTime + interval 5 seconds
"""),
   joinType = "leftOuter"
 )
 .writeStream
 .format("console")
 .option("truncate", false)
 .start(){code}
Some, not all, previously matched records (with even IDs) are omitted with 
outer NULLs along with all unmatched records (with odd IDs):

 
h6.  
{code:java}
[info] Batch: 1 
[info] +-+---+-+---+ 
[info] |fooId|fooTime                |barId|barTime                | 
[info] +-+---+-+-

[jira] [Commented] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale

2018-11-27 Thread Adrian Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700836#comment-16700836
 ] 

Adrian Wang commented on SPARK-26155:
-

[~Jk_Self] can you also test this on Spark 2.4?

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 3TB scale
> --
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
> Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis 
> in Spark2.3 without L486 & 487.pdf, q19.sql
>
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>   . Q19 costs about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21291) R bucketBy partitionBy API

2018-11-27 Thread Huaxin Gao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700829#comment-16700829
 ] 

Huaxin Gao commented on SPARK-21291:


[~felixcheung] Is it OK with you if I modify the title for this Jira and open a 
new one for bucketBy?

> R bucketBy partitionBy API
> --
>
> Key: SPARK-21291
> URL: https://issues.apache.org/jira/browse/SPARK-21291
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>
> partitionBy exists but it's for windowspec only



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26186) In progress applications with last updated time is lesser than the cleaning interval are getting removed during cleaning logs

2018-11-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700813#comment-16700813
 ] 

Apache Spark commented on SPARK-26186:
--

User 'shahidki31' has created a pull request for this issue:
https://github.com/apache/spark/pull/23158

> In progress applications with last updated time is lesser than the cleaning 
> interval are getting removed during cleaning logs
> -
>
> Key: SPARK-26186
> URL: https://issues.apache.org/jira/browse/SPARK-26186
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: shahid
>Priority: Major
>
> Inporgress applications with last updated time is withing the cleaning 
> interval are getting deleted.
>  
> Added a UT to test the scenario.
> {code:java}
> test("should not clean inprogress application with lastUpdated time less the 
> maxTime") {
> val firstFileModifiedTime = TimeUnit.DAYS.toMillis(1)
> val secondFileModifiedTime = TimeUnit.DAYS.toMillis(6)
> val maxAge = TimeUnit.DAYS.toMillis(7)
> val clock = new ManualClock(0)
> val provider = new FsHistoryProvider(
>   createTestConf().set("spark.history.fs.cleaner.maxAge", 
> s"${maxAge}ms"), clock)
> val log = newLogFile("inProgressApp1", None, inProgress = true)
> writeFile(log, true, None,
>   SparkListenerApplicationStart(
> "inProgressApp1", Some("inProgressApp1"), 3L, "test", 
> Some("attempt1"))
> )
> clock.setTime(firstFileModifiedTime)
> provider.checkForLogs()
> writeFile(log, true, None,
>   SparkListenerApplicationStart(
> "inProgressApp1", Some("inProgressApp1"), 3L, "test", 
> Some("attempt1")),
>   SparkListenerJobStart(0, 1L, Nil, null)
> )
> clock.setTime(secondFileModifiedTime)
> provider.checkForLogs()
> clock.setTime(TimeUnit.DAYS.toMillis(10))
> writeFile(log, true, None,
>   SparkListenerApplicationStart(
> "inProgressApp1", Some("inProgressApp1"), 3L, "test", 
> Some("attempt1")),
>   SparkListenerJobStart(0, 1L, Nil, null),
>   SparkListenerJobEnd(0, 1L, JobSucceeded)
> )
> provider.checkForLogs()
> // This should not trigger any cleanup
> updateAndCheck(provider) { list =>
>   list.size should be(1)
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26184) Last updated time is not getting updated in the History Server UI

2018-11-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700817#comment-16700817
 ] 

Apache Spark commented on SPARK-26184:
--

User 'shahidki31' has created a pull request for this issue:
https://github.com/apache/spark/pull/23158

> Last updated time is not getting updated in the History Server UI
> -
>
> Key: SPARK-26184
> URL: https://issues.apache.org/jira/browse/SPARK-26184
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.4.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: Screenshot from 2018-11-27 23-20-11.png, Screenshot from 
> 2018-11-27 23-22-38.png
>
>
> For inprogress application, last updated time is not getting updated.
>  !Screenshot from 2018-11-27 23-20-11.png! 
>  !Screenshot from 2018-11-27 23-22-38.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26184) Last updated time is not getting updated in the History Server UI

2018-11-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700815#comment-16700815
 ] 

Apache Spark commented on SPARK-26184:
--

User 'shahidki31' has created a pull request for this issue:
https://github.com/apache/spark/pull/23158

> Last updated time is not getting updated in the History Server UI
> -
>
> Key: SPARK-26184
> URL: https://issues.apache.org/jira/browse/SPARK-26184
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.4.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: Screenshot from 2018-11-27 23-20-11.png, Screenshot from 
> 2018-11-27 23-22-38.png
>
>
> For inprogress application, last updated time is not getting updated.
>  !Screenshot from 2018-11-27 23-20-11.png! 
>  !Screenshot from 2018-11-27 23-22-38.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26184) Last updated time is not getting updated in the History Server UI

2018-11-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700816#comment-16700816
 ] 

Apache Spark commented on SPARK-26184:
--

User 'shahidki31' has created a pull request for this issue:
https://github.com/apache/spark/pull/23158

> Last updated time is not getting updated in the History Server UI
> -
>
> Key: SPARK-26184
> URL: https://issues.apache.org/jira/browse/SPARK-26184
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.4.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: Screenshot from 2018-11-27 23-20-11.png, Screenshot from 
> 2018-11-27 23-22-38.png
>
>
> For inprogress application, last updated time is not getting updated.
>  !Screenshot from 2018-11-27 23-20-11.png! 
>  !Screenshot from 2018-11-27 23-22-38.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26184) Last updated time is not getting updated in the History Server UI

2018-11-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26184:


Assignee: Apache Spark

> Last updated time is not getting updated in the History Server UI
> -
>
> Key: SPARK-26184
> URL: https://issues.apache.org/jira/browse/SPARK-26184
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.4.0
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: Apache Spark
>Priority: Major
> Attachments: Screenshot from 2018-11-27 23-20-11.png, Screenshot from 
> 2018-11-27 23-22-38.png
>
>
> For inprogress application, last updated time is not getting updated.
>  !Screenshot from 2018-11-27 23-20-11.png! 
>  !Screenshot from 2018-11-27 23-22-38.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26184) Last updated time is not getting updated in the History Server UI

2018-11-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26184:


Assignee: (was: Apache Spark)

> Last updated time is not getting updated in the History Server UI
> -
>
> Key: SPARK-26184
> URL: https://issues.apache.org/jira/browse/SPARK-26184
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.4.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: Screenshot from 2018-11-27 23-20-11.png, Screenshot from 
> 2018-11-27 23-22-38.png
>
>
> For inprogress application, last updated time is not getting updated.
>  !Screenshot from 2018-11-27 23-20-11.png! 
>  !Screenshot from 2018-11-27 23-22-38.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26186) In progress applications with last updated time is lesser than the cleaning interval are getting removed during cleaning logs

2018-11-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26186:


Assignee: (was: Apache Spark)

> In progress applications with last updated time is lesser than the cleaning 
> interval are getting removed during cleaning logs
> -
>
> Key: SPARK-26186
> URL: https://issues.apache.org/jira/browse/SPARK-26186
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: shahid
>Priority: Major
>
> Inporgress applications with last updated time is withing the cleaning 
> interval are getting deleted.
>  
> Added a UT to test the scenario.
> {code:java}
> test("should not clean inprogress application with lastUpdated time less the 
> maxTime") {
> val firstFileModifiedTime = TimeUnit.DAYS.toMillis(1)
> val secondFileModifiedTime = TimeUnit.DAYS.toMillis(6)
> val maxAge = TimeUnit.DAYS.toMillis(7)
> val clock = new ManualClock(0)
> val provider = new FsHistoryProvider(
>   createTestConf().set("spark.history.fs.cleaner.maxAge", 
> s"${maxAge}ms"), clock)
> val log = newLogFile("inProgressApp1", None, inProgress = true)
> writeFile(log, true, None,
>   SparkListenerApplicationStart(
> "inProgressApp1", Some("inProgressApp1"), 3L, "test", 
> Some("attempt1"))
> )
> clock.setTime(firstFileModifiedTime)
> provider.checkForLogs()
> writeFile(log, true, None,
>   SparkListenerApplicationStart(
> "inProgressApp1", Some("inProgressApp1"), 3L, "test", 
> Some("attempt1")),
>   SparkListenerJobStart(0, 1L, Nil, null)
> )
> clock.setTime(secondFileModifiedTime)
> provider.checkForLogs()
> clock.setTime(TimeUnit.DAYS.toMillis(10))
> writeFile(log, true, None,
>   SparkListenerApplicationStart(
> "inProgressApp1", Some("inProgressApp1"), 3L, "test", 
> Some("attempt1")),
>   SparkListenerJobStart(0, 1L, Nil, null),
>   SparkListenerJobEnd(0, 1L, JobSucceeded)
> )
> provider.checkForLogs()
> // This should not trigger any cleanup
> updateAndCheck(provider) { list =>
>   list.size should be(1)
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26186) In progress applications with last updated time is lesser than the cleaning interval are getting removed during cleaning logs

2018-11-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26186:


Assignee: Apache Spark

> In progress applications with last updated time is lesser than the cleaning 
> interval are getting removed during cleaning logs
> -
>
> Key: SPARK-26186
> URL: https://issues.apache.org/jira/browse/SPARK-26186
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: shahid
>Assignee: Apache Spark
>Priority: Major
>
> Inporgress applications with last updated time is withing the cleaning 
> interval are getting deleted.
>  
> Added a UT to test the scenario.
> {code:java}
> test("should not clean inprogress application with lastUpdated time less the 
> maxTime") {
> val firstFileModifiedTime = TimeUnit.DAYS.toMillis(1)
> val secondFileModifiedTime = TimeUnit.DAYS.toMillis(6)
> val maxAge = TimeUnit.DAYS.toMillis(7)
> val clock = new ManualClock(0)
> val provider = new FsHistoryProvider(
>   createTestConf().set("spark.history.fs.cleaner.maxAge", 
> s"${maxAge}ms"), clock)
> val log = newLogFile("inProgressApp1", None, inProgress = true)
> writeFile(log, true, None,
>   SparkListenerApplicationStart(
> "inProgressApp1", Some("inProgressApp1"), 3L, "test", 
> Some("attempt1"))
> )
> clock.setTime(firstFileModifiedTime)
> provider.checkForLogs()
> writeFile(log, true, None,
>   SparkListenerApplicationStart(
> "inProgressApp1", Some("inProgressApp1"), 3L, "test", 
> Some("attempt1")),
>   SparkListenerJobStart(0, 1L, Nil, null)
> )
> clock.setTime(secondFileModifiedTime)
> provider.checkForLogs()
> clock.setTime(TimeUnit.DAYS.toMillis(10))
> writeFile(log, true, None,
>   SparkListenerApplicationStart(
> "inProgressApp1", Some("inProgressApp1"), 3L, "test", 
> Some("attempt1")),
>   SparkListenerJobStart(0, 1L, Nil, null),
>   SparkListenerJobEnd(0, 1L, JobSucceeded)
> )
> provider.checkForLogs()
> // This should not trigger any cleanup
> updateAndCheck(provider) { list =>
>   list.size should be(1)
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26186) In progress applications with last updated time is lesser than the cleaning interval are getting removed during cleaning logs

2018-11-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700814#comment-16700814
 ] 

Apache Spark commented on SPARK-26186:
--

User 'shahidki31' has created a pull request for this issue:
https://github.com/apache/spark/pull/23158

> In progress applications with last updated time is lesser than the cleaning 
> interval are getting removed during cleaning logs
> -
>
> Key: SPARK-26186
> URL: https://issues.apache.org/jira/browse/SPARK-26186
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: shahid
>Priority: Major
>
> Inporgress applications with last updated time is withing the cleaning 
> interval are getting deleted.
>  
> Added a UT to test the scenario.
> {code:java}
> test("should not clean inprogress application with lastUpdated time less the 
> maxTime") {
> val firstFileModifiedTime = TimeUnit.DAYS.toMillis(1)
> val secondFileModifiedTime = TimeUnit.DAYS.toMillis(6)
> val maxAge = TimeUnit.DAYS.toMillis(7)
> val clock = new ManualClock(0)
> val provider = new FsHistoryProvider(
>   createTestConf().set("spark.history.fs.cleaner.maxAge", 
> s"${maxAge}ms"), clock)
> val log = newLogFile("inProgressApp1", None, inProgress = true)
> writeFile(log, true, None,
>   SparkListenerApplicationStart(
> "inProgressApp1", Some("inProgressApp1"), 3L, "test", 
> Some("attempt1"))
> )
> clock.setTime(firstFileModifiedTime)
> provider.checkForLogs()
> writeFile(log, true, None,
>   SparkListenerApplicationStart(
> "inProgressApp1", Some("inProgressApp1"), 3L, "test", 
> Some("attempt1")),
>   SparkListenerJobStart(0, 1L, Nil, null)
> )
> clock.setTime(secondFileModifiedTime)
> provider.checkForLogs()
> clock.setTime(TimeUnit.DAYS.toMillis(10))
> writeFile(log, true, None,
>   SparkListenerApplicationStart(
> "inProgressApp1", Some("inProgressApp1"), 3L, "test", 
> Some("attempt1")),
>   SparkListenerJobStart(0, 1L, Nil, null),
>   SparkListenerJobEnd(0, 1L, JobSucceeded)
> )
> provider.checkForLogs()
> // This should not trigger any cleanup
> updateAndCheck(provider) { list =>
>   list.size should be(1)
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26185) add weightCol in python MulticlassClassificationEvaluator

2018-11-27 Thread Huaxin Gao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-26185:
---
Description: https://issues.apache.org/jira/browse/SPARK-24101 added 
weightCol in MulticlassClassificationEvaluator.scala. This Jira will add 
weightCol in python version of MulticlassClassificationEvaluator.  (was: 
--https://issues.apache.org/jira/browse/SPARK-24101-- added weightCol in 
MulticlassClassificationEvaluator.scala. This Jira will add weightCol in python 
version of MulticlassClassificationEvaluator.)

> add weightCol in python MulticlassClassificationEvaluator
> -
>
> Key: SPARK-26185
> URL: https://issues.apache.org/jira/browse/SPARK-26185
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> https://issues.apache.org/jira/browse/SPARK-24101 added weightCol in 
> MulticlassClassificationEvaluator.scala. This Jira will add weightCol in 
> python version of MulticlassClassificationEvaluator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26186) In progress applications with last updated time is lesser than the cleaning interval are getting removed during cleaning logs

2018-11-27 Thread shahid (JIRA)
shahid created SPARK-26186:
--

 Summary: In progress applications with last updated time is lesser 
than the cleaning interval are getting removed during cleaning logs
 Key: SPARK-26186
 URL: https://issues.apache.org/jira/browse/SPARK-26186
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0, 3.0.0
Reporter: shahid


Inporgress applications with last updated time is withing the cleaning interval 
are getting deleted.

 

Added a UT to test the scenario.

{code:java}
test("should not clean inprogress application with lastUpdated time less the 
maxTime") {
val firstFileModifiedTime = TimeUnit.DAYS.toMillis(1)
val secondFileModifiedTime = TimeUnit.DAYS.toMillis(6)
val maxAge = TimeUnit.DAYS.toMillis(7)
val clock = new ManualClock(0)
val provider = new FsHistoryProvider(
  createTestConf().set("spark.history.fs.cleaner.maxAge", s"${maxAge}ms"), 
clock)
val log = newLogFile("inProgressApp1", None, inProgress = true)
writeFile(log, true, None,
  SparkListenerApplicationStart(
"inProgressApp1", Some("inProgressApp1"), 3L, "test", Some("attempt1"))
)
clock.setTime(firstFileModifiedTime)
provider.checkForLogs()
writeFile(log, true, None,
  SparkListenerApplicationStart(
"inProgressApp1", Some("inProgressApp1"), 3L, "test", Some("attempt1")),
  SparkListenerJobStart(0, 1L, Nil, null)
)

clock.setTime(secondFileModifiedTime)
provider.checkForLogs()
clock.setTime(TimeUnit.DAYS.toMillis(10))
writeFile(log, true, None,
  SparkListenerApplicationStart(
"inProgressApp1", Some("inProgressApp1"), 3L, "test", Some("attempt1")),
  SparkListenerJobStart(0, 1L, Nil, null),
  SparkListenerJobEnd(0, 1L, JobSucceeded)
)
provider.checkForLogs()
// This should not trigger any cleanup
updateAndCheck(provider) { list =>
  list.size should be(1)
}
  }
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26184) Last updated time is not getting updated in the History Server UI

2018-11-27 Thread shahid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-26184:
---
Attachment: (was: Screenshot from 2018-11-27 13-21-34.png)

> Last updated time is not getting updated in the History Server UI
> -
>
> Key: SPARK-26184
> URL: https://issues.apache.org/jira/browse/SPARK-26184
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.4.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: Screenshot from 2018-11-27 23-20-11.png, screenshot-1.png
>
>
> For inprogress application, last updated time is not getting updated.
>  !Screenshot from 2018-11-27 23-20-11.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26184) Last updated time is not getting updated in the History Server UI

2018-11-27 Thread shahid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-26184:
---
Description: 
For inprogress application, last updated time is not getting updated.

 !Screenshot from 2018-11-27 13-21-34.png! 


 !screenshot-1.png! 

  was:
For inprogress application, last updated time is not getting updated.





> Last updated time is not getting updated in the History Server UI
> -
>
> Key: SPARK-26184
> URL: https://issues.apache.org/jira/browse/SPARK-26184
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.4.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: Screenshot from 2018-11-27 13-21-34.png, screenshot-1.png
>
>
> For inprogress application, last updated time is not getting updated.
>  !Screenshot from 2018-11-27 13-21-34.png! 
>  !screenshot-1.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26185) add weightCol in python MulticlassClassificationEvaluator

2018-11-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26185:


Assignee: Apache Spark

> add weightCol in python MulticlassClassificationEvaluator
> -
>
> Key: SPARK-26185
> URL: https://issues.apache.org/jira/browse/SPARK-26185
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Minor
>
> --https://issues.apache.org/jira/browse/SPARK-24101-- added weightCol in 
> MulticlassClassificationEvaluator.scala. This Jira will add weightCol in 
> python version of MulticlassClassificationEvaluator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26185) add weightCol in python MulticlassClassificationEvaluator

2018-11-27 Thread Huaxin Gao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-26185:
---
Description: --https://issues.apache.org/jira/browse/SPARK-24101-- added 
weightCol in MulticlassClassificationEvaluator.scala. This Jira will add 
weightCol in python version of MulticlassClassificationEvaluator.  (was: 
-https://issues.apache.org/jira/browse/SPARK-24101- added weightCol in 
MulticlassClassificationEvaluator.scala. This Jira will add weightCol in python 
version of MulticlassClassificationEvaluator.)

> add weightCol in python MulticlassClassificationEvaluator
> -
>
> Key: SPARK-26185
> URL: https://issues.apache.org/jira/browse/SPARK-26185
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> --https://issues.apache.org/jira/browse/SPARK-24101-- added weightCol in 
> MulticlassClassificationEvaluator.scala. This Jira will add weightCol in 
> python version of MulticlassClassificationEvaluator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26185) add weightCol in python MulticlassClassificationEvaluator

2018-11-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700806#comment-16700806
 ] 

Apache Spark commented on SPARK-26185:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/23157

> add weightCol in python MulticlassClassificationEvaluator
> -
>
> Key: SPARK-26185
> URL: https://issues.apache.org/jira/browse/SPARK-26185
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> --https://issues.apache.org/jira/browse/SPARK-24101-- added weightCol in 
> MulticlassClassificationEvaluator.scala. This Jira will add weightCol in 
> python version of MulticlassClassificationEvaluator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26185) add weightCol in python MulticlassClassificationEvaluator

2018-11-27 Thread Huaxin Gao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-26185:
---
Description: -https://issues.apache.org/jira/browse/SPARK-24101- added 
weightCol in MulticlassClassificationEvaluator.scala. This Jira will add 
weightCol in python version of MulticlassClassificationEvaluator.  (was: 
https://issues.apache.org/jira/browse/SPARK-24101 added weightCol in 
MulticlassClassificationEvaluator.scala. This Jira will add weightCol in python 
version of MulticlassClassificationEvaluator.)

> add weightCol in python MulticlassClassificationEvaluator
> -
>
> Key: SPARK-26185
> URL: https://issues.apache.org/jira/browse/SPARK-26185
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> -https://issues.apache.org/jira/browse/SPARK-24101- added weightCol in 
> MulticlassClassificationEvaluator.scala. This Jira will add weightCol in 
> python version of MulticlassClassificationEvaluator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26185) add weightCol in python MulticlassClassificationEvaluator

2018-11-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26185:


Assignee: (was: Apache Spark)

> add weightCol in python MulticlassClassificationEvaluator
> -
>
> Key: SPARK-26185
> URL: https://issues.apache.org/jira/browse/SPARK-26185
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> --https://issues.apache.org/jira/browse/SPARK-24101-- added weightCol in 
> MulticlassClassificationEvaluator.scala. This Jira will add weightCol in 
> python version of MulticlassClassificationEvaluator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26186) In progress applications with last updated time is lesser than the cleaning interval are getting removed during cleaning logs

2018-11-27 Thread shahid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700804#comment-16700804
 ] 

shahid commented on SPARK-26186:


I will raise a PR.

> In progress applications with last updated time is lesser than the cleaning 
> interval are getting removed during cleaning logs
> -
>
> Key: SPARK-26186
> URL: https://issues.apache.org/jira/browse/SPARK-26186
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: shahid
>Priority: Major
>
> Inporgress applications with last updated time is withing the cleaning 
> interval are getting deleted.
>  
> Added a UT to test the scenario.
> {code:java}
> test("should not clean inprogress application with lastUpdated time less the 
> maxTime") {
> val firstFileModifiedTime = TimeUnit.DAYS.toMillis(1)
> val secondFileModifiedTime = TimeUnit.DAYS.toMillis(6)
> val maxAge = TimeUnit.DAYS.toMillis(7)
> val clock = new ManualClock(0)
> val provider = new FsHistoryProvider(
>   createTestConf().set("spark.history.fs.cleaner.maxAge", 
> s"${maxAge}ms"), clock)
> val log = newLogFile("inProgressApp1", None, inProgress = true)
> writeFile(log, true, None,
>   SparkListenerApplicationStart(
> "inProgressApp1", Some("inProgressApp1"), 3L, "test", 
> Some("attempt1"))
> )
> clock.setTime(firstFileModifiedTime)
> provider.checkForLogs()
> writeFile(log, true, None,
>   SparkListenerApplicationStart(
> "inProgressApp1", Some("inProgressApp1"), 3L, "test", 
> Some("attempt1")),
>   SparkListenerJobStart(0, 1L, Nil, null)
> )
> clock.setTime(secondFileModifiedTime)
> provider.checkForLogs()
> clock.setTime(TimeUnit.DAYS.toMillis(10))
> writeFile(log, true, None,
>   SparkListenerApplicationStart(
> "inProgressApp1", Some("inProgressApp1"), 3L, "test", 
> Some("attempt1")),
>   SparkListenerJobStart(0, 1L, Nil, null),
>   SparkListenerJobEnd(0, 1L, JobSucceeded)
> )
> provider.checkForLogs()
> // This should not trigger any cleanup
> updateAndCheck(provider) { list =>
>   list.size should be(1)
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-11-27 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700799#comment-16700799
 ] 

Maxim Gekk commented on SPARK-23410:


> Even if lineSeps is set, it is still necessary to identify the file bom 
> charset. 

For sure encoding should be inferred from BOM in any case. I just want to take 
you attention on the case when when lineSep is not set. We infer lineSep in 
UTF-8, and most likely should do the same for other encodings.

> In my opinion, we can try to read the first four bytes of the file on the 
> executor side to identify the encoding of the file.Because once the charset 
> of the file is determined, the charset of lineSeps is also determined.

For JSON, we create JacksonParser on the driver side before file reading. For 
example: 
https://github.com/apache/spark/blob/e9af9460bc008106b670abac44a869721bfde42a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala#L105-L107
 . Need to take this into account.

 [~x1q1j1] Please, open a PR, I am ready to review it.

> I know BOM is only the beginning of the file ..
[~hyukjin.kwon] I just double checked ;-)



> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26184) Last updated time is not getting updated in the History Server UI

2018-11-27 Thread shahid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-26184:
---
Attachment: Screenshot from 2018-11-27 23-22-38.png

> Last updated time is not getting updated in the History Server UI
> -
>
> Key: SPARK-26184
> URL: https://issues.apache.org/jira/browse/SPARK-26184
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.4.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: Screenshot from 2018-11-27 23-20-11.png, Screenshot from 
> 2018-11-27 23-22-38.png
>
>
> For inprogress application, last updated time is not getting updated.
>  !Screenshot from 2018-11-27 23-20-11.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26184) Last updated time is not getting updated in the History Server UI

2018-11-27 Thread shahid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-26184:
---
Description: 
For inprogress application, last updated time is not getting updated.



 !Screenshot from 2018-11-27 23-20-11.png! 

 !Screenshot from 2018-11-27 23-22-38.png! 

  was:
For inprogress application, last updated time is not getting updated.


 !Screenshot from 2018-11-27 23-20-11.png! 


> Last updated time is not getting updated in the History Server UI
> -
>
> Key: SPARK-26184
> URL: https://issues.apache.org/jira/browse/SPARK-26184
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.4.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: Screenshot from 2018-11-27 23-20-11.png, Screenshot from 
> 2018-11-27 23-22-38.png
>
>
> For inprogress application, last updated time is not getting updated.
>  !Screenshot from 2018-11-27 23-20-11.png! 
>  !Screenshot from 2018-11-27 23-22-38.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26184) Last updated time is not getting updated in the History Server UI

2018-11-27 Thread shahid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-26184:
---
Attachment: (was: screenshot-1.png)

> Last updated time is not getting updated in the History Server UI
> -
>
> Key: SPARK-26184
> URL: https://issues.apache.org/jira/browse/SPARK-26184
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.4.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: Screenshot from 2018-11-27 23-20-11.png
>
>
> For inprogress application, last updated time is not getting updated.
>  !Screenshot from 2018-11-27 23-20-11.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26184) Last updated time is not getting updated in the History Server UI

2018-11-27 Thread shahid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-26184:
---
Attachment: Screenshot from 2018-11-27 23-20-11.png

> Last updated time is not getting updated in the History Server UI
> -
>
> Key: SPARK-26184
> URL: https://issues.apache.org/jira/browse/SPARK-26184
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.4.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: Screenshot from 2018-11-27 23-20-11.png, screenshot-1.png
>
>
> For inprogress application, last updated time is not getting updated.
>  !Screenshot from 2018-11-27 13-21-34.png! 
>  !screenshot-1.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26184) Last updated time is not getting updated in the History Server UI

2018-11-27 Thread shahid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-26184:
---
Description: 
For inprogress application, last updated time is not getting updated.


 !Screenshot from 2018-11-27 23-20-11.png! 

  was:
For inprogress application, last updated time is not getting updated.

 !Screenshot from 2018-11-27 13-21-34.png! 


 !screenshot-1.png! 


> Last updated time is not getting updated in the History Server UI
> -
>
> Key: SPARK-26184
> URL: https://issues.apache.org/jira/browse/SPARK-26184
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.4.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: Screenshot from 2018-11-27 23-20-11.png, screenshot-1.png
>
>
> For inprogress application, last updated time is not getting updated.
>  !Screenshot from 2018-11-27 23-20-11.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >