[jira] [Resolved] (SPARK-29298) Separate block manager heartbeat endpoint from driver endpoint

2019-11-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29298.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25971
[https://github.com/apache/spark/pull/25971]

> Separate block manager heartbeat endpoint from driver endpoint
> --
>
> Key: SPARK-29298
> URL: https://issues.apache.org/jira/browse/SPARK-29298
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Assignee: Lantao Jin
>Priority: Major
> Fix For: 3.0.0
>
>
> Executor's heartbeat will send synchronously to BlockManagerMaster to let it 
> know that the block manager is still alive. In a heavy cluster, it will 
> timeout and cause block manager re-register unexpected.
> This improvement will separate a heartbeat endpoint from the driver endpoint. 
> In our production environment, this is really helpful to prevent executors 
> from unstable up and down.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29298) Separate block manager heartbeat endpoint from driver endpoint

2019-11-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29298:
---

Assignee: Lantao Jin

> Separate block manager heartbeat endpoint from driver endpoint
> --
>
> Key: SPARK-29298
> URL: https://issues.apache.org/jira/browse/SPARK-29298
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Assignee: Lantao Jin
>Priority: Major
>
> Executor's heartbeat will send synchronously to BlockManagerMaster to let it 
> know that the block manager is still alive. In a heavy cluster, it will 
> timeout and cause block manager re-register unexpected.
> This improvement will separate a heartbeat endpoint from the driver endpoint. 
> In our production environment, this is really helpful to prevent executors 
> from unstable up and down.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark

2019-11-12 Thread zhao bo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972184#comment-16972184
 ] 

zhao bo commented on SPARK-29106:
-

And for the above several comments I left, we hope [~shaneknapp], you could 
give us some kind advices, we plan to on board it recently, but need some test, 
so we think to keep the existing worker for a period, and we integrate the new 
VM at the same time.

Here I can share current VM summary to you:

cpu: 8u

Memory: 16G

location: BeiJing, China (This could affect the network performance as the VM 
is in China, but we had improved the network performance via configuring 
internal source)

As we tested, the same maven test will last 5h33min, and build will last 
43min。So I think it's good to be a test worker in jenkins. ;)

 

Thanks

 

> Add jenkins arm test for spark
> --
>
> Key: SPARK-29106
> URL: https://issues.apache.org/jira/browse/SPARK-29106
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Minor
> Attachments: R-ansible.yml, R-libs.txt, arm-python36.txt
>
>
> Add arm test jobs to amplab jenkins for spark.
> Till now we made two arm test periodic jobs for spark in OpenLab, one is 
> based on master with hadoop 2.7(similar with QA test of amplab jenkins), 
> other one is based on a new branch which we made on date 09-09, see  
> [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64]
>   and 
> [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64]
>  We only have to care about the first one when integrate arm test with amplab 
> jenkins.
> About the k8s test on arm, we have took test it, see 
> [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it 
> later. 
> And we plan test on other stable branches too, and we can integrate them to 
> amplab when they are ready.
> We have offered an arm instance and sent the infos to shane knapp, thanks 
> shane to add the first arm job to amplab jenkins :) 
> The other important thing is about the leveldbjni 
> [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80]
>  spark depends on leveldbjni-all-1.8 
> [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8],
>  we can see there is no arm64 supporting. So we build an arm64 supporting 
> release of leveldbjni see 
> [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8],
>  but we can't modified the spark pom.xml directly with something like 
> 'property'/'profile' to choose correct jar package on arm or x86 platform, 
> because spark depends on some hadoop packages like hadoop-hdfs, the packages 
> depend on leveldbjni-all-1.8 too, unless hadoop release with new arm 
> supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of 
> openlabtesting and 'mvn install' to use it when arm testing for spark.
> PS: The issues found and fixed:
>  SPARK-28770
>  [https://github.com/apache/spark/pull/25673]
>   
>  SPARK-28519
>  [https://github.com/apache/spark/pull/25279]
>   
>  SPARK-28433
>  [https://github.com/apache/spark/pull/25186]
>  
> SPARK-28467
> [https://github.com/apache/spark/pull/25864]
>  
> SPARK-29286
> [https://github.com/apache/spark/pull/26021]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file

2019-11-12 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972223#comment-16972223
 ] 

Hyukjin Kwon commented on SPARK-29764:
--

priority doesn't matter about getting a help. It is just related to release 
processes. Issues that are difficult to read or don't look clear to reproduce 
don't get attention a lot often.
If I were you, I would try to make a minimsied reproducer without unrelated 
information, rather than just copy and past the whole codes.


> Error on Serializing POJO with java datetime property to a Parquet file
> ---
>
> Key: SPARK-29764
> URL: https://issues.apache.org/jira/browse/SPARK-29764
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core, SQL
>Affects Versions: 2.4.4
>Reporter: Felix Kizhakkel Jose
>Priority: Major
> Attachments: SparkParquetSampleCode.docx
>
>
> Hello,
>  I have been doing a proof of concept for data lake structure and analytics 
> using Apache Spark. 
>  When I add a java java.time.LocalDateTime/java.time.LocalDate properties in 
> my data model, the serialization to Parquet start failing.
>  *My Data Model:*
> @Data
>  public class Employee
> { private UUID id = UUID.randomUUID(); private String name; private int age; 
> private LocalDate dob; private LocalDateTime startDateTime; private String 
> phone; private Address address; }
>  
>  *Serialization Snippet*
> {color:#0747a6}public void serialize(){color}
> {color:#0747a6}{ List inputDataToSerialize = 
> getInputDataToSerialize(); // this creates 100,000 employee objects 
> Encoder employeeEncoder = Encoders.bean(Employee.class); 
> Dataset employeeDataset = sparkSession.createDataset( 
> inputDataToSerialize, employeeEncoder ); employeeDataset.write() 
> .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); 
> }{color}
> +*Exception Stack Trace:*
>  +
>  *java.lang.IllegalStateException: Failed to execute 
> CommandLineRunnerjava.lang.IllegalStateException: Failed to execute 
> CommandLineRunner at 
> org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803)
>  at 
> org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784)
>  at 
> org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771)
>  at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) 
> at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) 
> at com.felix.Application.main(Application.java:45)Caused by: 
> org.apache.spark.SparkException: Job aborted. at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) 
> at 
> org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
>  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at 
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566) at 
> com.felix.SparkParquetSerializer.serialize(SparkParquetSerializer.java:24) at 
> com.felix.Application.run(Application.java:63) at

[jira] [Commented] (SPARK-29856) Conditional unnecessary persist on RDDs in ML algorithms

2019-11-12 Thread Enzo Bonnal (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972224#comment-16972224
 ] 

Enzo Bonnal commented on SPARK-29856:
-

Just a note: if I am not wrong_, findBestSplits_ may leverage the caching if 
_nodeIdCache.nonEmpty._ Have you took this into account ?

> Conditional unnecessary persist on RDDs in ML algorithms
> 
>
> Key: SPARK-29856
> URL: https://issues.apache.org/jira/browse/SPARK-29856
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 3.0.0
>Reporter: Dong Wang
>Priority: Major
>
> When I run example.ml.GradientBoostedTreeRegressorExample, I find that RDD 
> _{color:#DE350B}baggedInput{color}_ in _ml.tree.impl.RandomForest.run()_ is 
> persisted, but it only used once. So this persist operation is unnecessary.
> {code:scala}
> val baggedInput = BaggedPoint
>   .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, 
> withReplacement,
> (tp: TreePoint) => tp.weight, seed = seed)
>   .persist(StorageLevel.MEMORY_AND_DISK)
>   ...
>while (nodeStack.nonEmpty) {
>   ...
>   timer.start("findBestSplits")
>   RandomForest.findBestSplits(baggedInput, metadata, topNodesForGroup, 
> nodesForGroup,
> treeToNodeToIndexInfo, splits, nodeStack, timer, nodeIdCache)
>   timer.stop("findBestSplits")
> }
> baggedInput.unpersist()
> {code}
> However, the action on {color:#DE350B}_baggedInput_{color} is in a while 
> loop. 
> In GradientBoostedTreeRegressorExample, this loop only executes once, so only 
> one action uses {color:#DE350B}_baggedInput_{color}.
> In most of ML applications, the loop will executes for many times, which 
> means {color:#DE350B}_baggedInput_{color} will be used in many actions. So 
> the persist is necessary now.
> That's the point why the persist operation is "conditional" unnecessary.
> Same situations exist in many other ML algorithms, e.g., RDD 
> {color:#DE350B}_instances_{color} in ml.clustering.KMeans.fit(), RDD 
> {color:#DE350B}_indices_{color} in mllib.clustering.BisectingKMeans.run().
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29856) Conditional unnecessary persist on RDDs in ML algorithms

2019-11-12 Thread Enzo Bonnal (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972224#comment-16972224
 ] 

Enzo Bonnal edited comment on SPARK-29856 at 11/12/19 9:39 AM:
---

Just a note: if I am not wrong, _findBestSplits_ may leverage the caching if 
_nodeIdCache.nonEmpty._ Have you took this into account ?


was (Author: enzobnl):
Just a note: if I am not wrong_, findBestSplits_ may leverage the caching if 
_nodeIdCache.nonEmpty._ Have you took this into account ?

> Conditional unnecessary persist on RDDs in ML algorithms
> 
>
> Key: SPARK-29856
> URL: https://issues.apache.org/jira/browse/SPARK-29856
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 3.0.0
>Reporter: Dong Wang
>Priority: Major
>
> When I run example.ml.GradientBoostedTreeRegressorExample, I find that RDD 
> _{color:#DE350B}baggedInput{color}_ in _ml.tree.impl.RandomForest.run()_ is 
> persisted, but it only used once. So this persist operation is unnecessary.
> {code:scala}
> val baggedInput = BaggedPoint
>   .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, 
> withReplacement,
> (tp: TreePoint) => tp.weight, seed = seed)
>   .persist(StorageLevel.MEMORY_AND_DISK)
>   ...
>while (nodeStack.nonEmpty) {
>   ...
>   timer.start("findBestSplits")
>   RandomForest.findBestSplits(baggedInput, metadata, topNodesForGroup, 
> nodesForGroup,
> treeToNodeToIndexInfo, splits, nodeStack, timer, nodeIdCache)
>   timer.stop("findBestSplits")
> }
> baggedInput.unpersist()
> {code}
> However, the action on {color:#DE350B}_baggedInput_{color} is in a while 
> loop. 
> In GradientBoostedTreeRegressorExample, this loop only executes once, so only 
> one action uses {color:#DE350B}_baggedInput_{color}.
> In most of ML applications, the loop will executes for many times, which 
> means {color:#DE350B}_baggedInput_{color} will be used in many actions. So 
> the persist is necessary now.
> That's the point why the persist operation is "conditional" unnecessary.
> Same situations exist in many other ML algorithms, e.g., RDD 
> {color:#DE350B}_instances_{color} in ml.clustering.KMeans.fit(), RDD 
> {color:#DE350B}_indices_{color} in mllib.clustering.BisectingKMeans.run().
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29856) Conditional unnecessary persist on RDDs in ML algorithms

2019-11-12 Thread EnzoBnl (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972224#comment-16972224
 ] 

EnzoBnl edited comment on SPARK-29856 at 11/12/19 9:42 AM:
---

Just a note:

If I am not wrong, _findBestSplits_ may leverage the caching if 
_nodeIdCache.nonEmpty._ Have you taken this into account ?


was (Author: enzobnl):
Just a note: if I am not wrong, _findBestSplits_ may leverage the caching if 
_nodeIdCache.nonEmpty._ Have you taken this into account ?

> Conditional unnecessary persist on RDDs in ML algorithms
> 
>
> Key: SPARK-29856
> URL: https://issues.apache.org/jira/browse/SPARK-29856
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 3.0.0
>Reporter: Dong Wang
>Priority: Major
>
> When I run example.ml.GradientBoostedTreeRegressorExample, I find that RDD 
> _{color:#DE350B}baggedInput{color}_ in _ml.tree.impl.RandomForest.run()_ is 
> persisted, but it only used once. So this persist operation is unnecessary.
> {code:scala}
> val baggedInput = BaggedPoint
>   .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, 
> withReplacement,
> (tp: TreePoint) => tp.weight, seed = seed)
>   .persist(StorageLevel.MEMORY_AND_DISK)
>   ...
>while (nodeStack.nonEmpty) {
>   ...
>   timer.start("findBestSplits")
>   RandomForest.findBestSplits(baggedInput, metadata, topNodesForGroup, 
> nodesForGroup,
> treeToNodeToIndexInfo, splits, nodeStack, timer, nodeIdCache)
>   timer.stop("findBestSplits")
> }
> baggedInput.unpersist()
> {code}
> However, the action on {color:#DE350B}_baggedInput_{color} is in a while 
> loop. 
> In GradientBoostedTreeRegressorExample, this loop only executes once, so only 
> one action uses {color:#DE350B}_baggedInput_{color}.
> In most of ML applications, the loop will executes for many times, which 
> means {color:#DE350B}_baggedInput_{color} will be used in many actions. So 
> the persist is necessary now.
> That's the point why the persist operation is "conditional" unnecessary.
> Same situations exist in many other ML algorithms, e.g., RDD 
> {color:#DE350B}_instances_{color} in ml.clustering.KMeans.fit(), RDD 
> {color:#DE350B}_indices_{color} in mllib.clustering.BisectingKMeans.run().
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29856) Conditional unnecessary persist on RDDs in ML algorithms

2019-11-12 Thread EnzoBnl (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972224#comment-16972224
 ] 

EnzoBnl edited comment on SPARK-29856 at 11/12/19 9:41 AM:
---

Just a note: if I am not wrong, _findBestSplits_ may leverage the caching if 
_nodeIdCache.nonEmpty._ Have you taken this into account ?


was (Author: enzobnl):
Just a note: if I am not wrong, _findBestSplits_ may leverage the caching if 
_nodeIdCache.nonEmpty._ Have you took this into account ?

> Conditional unnecessary persist on RDDs in ML algorithms
> 
>
> Key: SPARK-29856
> URL: https://issues.apache.org/jira/browse/SPARK-29856
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 3.0.0
>Reporter: Dong Wang
>Priority: Major
>
> When I run example.ml.GradientBoostedTreeRegressorExample, I find that RDD 
> _{color:#DE350B}baggedInput{color}_ in _ml.tree.impl.RandomForest.run()_ is 
> persisted, but it only used once. So this persist operation is unnecessary.
> {code:scala}
> val baggedInput = BaggedPoint
>   .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, 
> withReplacement,
> (tp: TreePoint) => tp.weight, seed = seed)
>   .persist(StorageLevel.MEMORY_AND_DISK)
>   ...
>while (nodeStack.nonEmpty) {
>   ...
>   timer.start("findBestSplits")
>   RandomForest.findBestSplits(baggedInput, metadata, topNodesForGroup, 
> nodesForGroup,
> treeToNodeToIndexInfo, splits, nodeStack, timer, nodeIdCache)
>   timer.stop("findBestSplits")
> }
> baggedInput.unpersist()
> {code}
> However, the action on {color:#DE350B}_baggedInput_{color} is in a while 
> loop. 
> In GradientBoostedTreeRegressorExample, this loop only executes once, so only 
> one action uses {color:#DE350B}_baggedInput_{color}.
> In most of ML applications, the loop will executes for many times, which 
> means {color:#DE350B}_baggedInput_{color} will be used in many actions. So 
> the persist is necessary now.
> That's the point why the persist operation is "conditional" unnecessary.
> Same situations exist in many other ML algorithms, e.g., RDD 
> {color:#DE350B}_instances_{color} in ml.clustering.KMeans.fit(), RDD 
> {color:#DE350B}_indices_{color} in mllib.clustering.BisectingKMeans.run().
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29856) Conditional unnecessary persist on RDDs in ML algorithms

2019-11-12 Thread EnzoBnl (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972224#comment-16972224
 ] 

EnzoBnl edited comment on SPARK-29856 at 11/12/19 9:47 AM:
---

If I am not wrong, _findBestSplits_ may leverage the caching if 
_nodeIdCache.nonEmpty._ Have you taken this into account ?


was (Author: enzobnl):
Just a note:

If I am not wrong, _findBestSplits_ may leverage the caching if 
_nodeIdCache.nonEmpty._ Have you taken this into account ?

> Conditional unnecessary persist on RDDs in ML algorithms
> 
>
> Key: SPARK-29856
> URL: https://issues.apache.org/jira/browse/SPARK-29856
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 3.0.0
>Reporter: Dong Wang
>Priority: Major
>
> When I run example.ml.GradientBoostedTreeRegressorExample, I find that RDD 
> _{color:#DE350B}baggedInput{color}_ in _ml.tree.impl.RandomForest.run()_ is 
> persisted, but it only used once. So this persist operation is unnecessary.
> {code:scala}
> val baggedInput = BaggedPoint
>   .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, 
> withReplacement,
> (tp: TreePoint) => tp.weight, seed = seed)
>   .persist(StorageLevel.MEMORY_AND_DISK)
>   ...
>while (nodeStack.nonEmpty) {
>   ...
>   timer.start("findBestSplits")
>   RandomForest.findBestSplits(baggedInput, metadata, topNodesForGroup, 
> nodesForGroup,
> treeToNodeToIndexInfo, splits, nodeStack, timer, nodeIdCache)
>   timer.stop("findBestSplits")
> }
> baggedInput.unpersist()
> {code}
> However, the action on {color:#DE350B}_baggedInput_{color} is in a while 
> loop. 
> In GradientBoostedTreeRegressorExample, this loop only executes once, so only 
> one action uses {color:#DE350B}_baggedInput_{color}.
> In most of ML applications, the loop will executes for many times, which 
> means {color:#DE350B}_baggedInput_{color} will be used in many actions. So 
> the persist is necessary now.
> That's the point why the persist operation is "conditional" unnecessary.
> Same situations exist in many other ML algorithms, e.g., RDD 
> {color:#DE350B}_instances_{color} in ml.clustering.KMeans.fit(), RDD 
> {color:#DE350B}_indices_{color} in mllib.clustering.BisectingKMeans.run().
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29855) typed literals with negative sign with proper result or exception

2019-11-12 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972292#comment-16972292
 ] 

Hyukjin Kwon commented on SPARK-29855:
--

[~Qin Yao] Please link related JIRAs for the initial literal supports.

> typed literals with negative sign with proper result or exception
> -
>
> Key: SPARK-29855
> URL: https://issues.apache.org/jira/browse/SPARK-29855
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Priority: Major
>
> {code:java}
> -- !query 83
> select -integer '7'
> -- !query 83 schema
> struct<7:int>
> -- !query 83 output
> 7
> -- !query 86
> select -date '1999-01-01'
> -- !query 86 schema
> struct
> -- !query 86 output
> 1999-01-01
> -- !query 87
> select -timestamp '1999-01-01'
> -- !query 87 schema
> struct
> -- !query 87 output
> 1999-01-01 00:00:00
> {code}
> the integer should be -7 and the date and timestamp results are confusing 
> which should throw exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29856) Conditional unnecessary persist on RDDs in ML algorithms

2019-11-12 Thread Dong Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972294#comment-16972294
 ] 

Dong Wang commented on SPARK-29856:
---

But there is only one action _collectAsMap()_ using _baggedInput_. If the loop 
only execute once, _baggedInput_  is only used once.

> Conditional unnecessary persist on RDDs in ML algorithms
> 
>
> Key: SPARK-29856
> URL: https://issues.apache.org/jira/browse/SPARK-29856
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 3.0.0
>Reporter: Dong Wang
>Priority: Major
>
> When I run example.ml.GradientBoostedTreeRegressorExample, I find that RDD 
> _{color:#DE350B}baggedInput{color}_ in _ml.tree.impl.RandomForest.run()_ is 
> persisted, but it only used once. So this persist operation is unnecessary.
> {code:scala}
> val baggedInput = BaggedPoint
>   .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, 
> withReplacement,
> (tp: TreePoint) => tp.weight, seed = seed)
>   .persist(StorageLevel.MEMORY_AND_DISK)
>   ...
>while (nodeStack.nonEmpty) {
>   ...
>   timer.start("findBestSplits")
>   RandomForest.findBestSplits(baggedInput, metadata, topNodesForGroup, 
> nodesForGroup,
> treeToNodeToIndexInfo, splits, nodeStack, timer, nodeIdCache)
>   timer.stop("findBestSplits")
> }
> baggedInput.unpersist()
> {code}
> However, the action on {color:#DE350B}_baggedInput_{color} is in a while 
> loop. 
> In GradientBoostedTreeRegressorExample, this loop only executes once, so only 
> one action uses {color:#DE350B}_baggedInput_{color}.
> In most of ML applications, the loop will executes for many times, which 
> means {color:#DE350B}_baggedInput_{color} will be used in many actions. So 
> the persist is necessary now.
> That's the point why the persist operation is "conditional" unnecessary.
> Same situations exist in many other ML algorithms, e.g., RDD 
> {color:#DE350B}_instances_{color} in ml.clustering.KMeans.fit(), RDD 
> {color:#DE350B}_indices_{color} in mllib.clustering.BisectingKMeans.run().
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29856) Conditional unnecessary persist on RDDs in ML algorithms

2019-11-12 Thread Dong Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972294#comment-16972294
 ] 

Dong Wang edited comment on SPARK-29856 at 11/12/19 11:32 AM:
--

But there is only one action _collectAsMap()_ using _baggedInput_ in 
_findBestSplits_. If the loop only execute once, _baggedInput_  is only used 
once.


was (Author: spark_cachecheck):
But there is only one action _collectAsMap()_ using _baggedInput_. If the loop 
only execute once, _baggedInput_  is only used once.

> Conditional unnecessary persist on RDDs in ML algorithms
> 
>
> Key: SPARK-29856
> URL: https://issues.apache.org/jira/browse/SPARK-29856
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 3.0.0
>Reporter: Dong Wang
>Priority: Major
>
> When I run example.ml.GradientBoostedTreeRegressorExample, I find that RDD 
> _{color:#DE350B}baggedInput{color}_ in _ml.tree.impl.RandomForest.run()_ is 
> persisted, but it only used once. So this persist operation is unnecessary.
> {code:scala}
> val baggedInput = BaggedPoint
>   .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, 
> withReplacement,
> (tp: TreePoint) => tp.weight, seed = seed)
>   .persist(StorageLevel.MEMORY_AND_DISK)
>   ...
>while (nodeStack.nonEmpty) {
>   ...
>   timer.start("findBestSplits")
>   RandomForest.findBestSplits(baggedInput, metadata, topNodesForGroup, 
> nodesForGroup,
> treeToNodeToIndexInfo, splits, nodeStack, timer, nodeIdCache)
>   timer.stop("findBestSplits")
> }
> baggedInput.unpersist()
> {code}
> However, the action on {color:#DE350B}_baggedInput_{color} is in a while 
> loop. 
> In GradientBoostedTreeRegressorExample, this loop only executes once, so only 
> one action uses {color:#DE350B}_baggedInput_{color}.
> In most of ML applications, the loop will executes for many times, which 
> means {color:#DE350B}_baggedInput_{color} will be used in many actions. So 
> the persist is necessary now.
> That's the point why the persist operation is "conditional" unnecessary.
> Same situations exist in many other ML algorithms, e.g., RDD 
> {color:#DE350B}_instances_{color} in ml.clustering.KMeans.fit(), RDD 
> {color:#DE350B}_indices_{color} in mllib.clustering.BisectingKMeans.run().
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29854) lpad and rpad built in function not throw Exception for invalid len value

2019-11-12 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972302#comment-16972302
 ] 

Hyukjin Kwon commented on SPARK-29854:
--

[~abhishek.akg] Can you please use \{code\} ... \{code\} block.

> lpad and rpad built in function not throw Exception for invalid len value
> -
>
> Key: SPARK-29854
> URL: https://issues.apache.org/jira/browse/SPARK-29854
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> Spark Returns Empty String)
> 0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
> lpad('hihhh', 5000, '');
>  ++
> |lpad(hihhh, CAST(5000 AS INT), 
> )|
> ++
> ++
> Hive:
> SELECT lpad('hihhh', 5000, 
> '');
>  Error: Error while compiling statement: FAILED: SemanticException [Error 
> 10016]: Line 1:67 Argument type mismatch '''': lpad only takes 
> INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016)
> PostgreSQL
> function lpad(unknown, numeric, unknown) does not exist
>  
> Expected output:
> In Spark also it should throw Exception like Hive
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29854) lpad and rpad built in function not throw Exception for invalid len value

2019-11-12 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972303#comment-16972303
 ] 

Hyukjin Kwon commented on SPARK-29854:
--

Other DMBSes have rpad and lpad too. Can you check?

> lpad and rpad built in function not throw Exception for invalid len value
> -
>
> Key: SPARK-29854
> URL: https://issues.apache.org/jira/browse/SPARK-29854
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> Spark Returns Empty String)
> {code}
> 0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
> lpad('hihhh', 5000, '');
>  ++
> |lpad(hihhh, CAST(5000 AS INT), 
> )|
> ++
> ++
> {code}
> Hive:
> {code}
> SELECT lpad('hihhh', 5000, 
> '');
>  Error: Error while compiling statement: FAILED: SemanticException [Error 
> 10016]: Line 1:67 Argument type mismatch '''': lpad only takes 
> INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016)
> {code}
> PostgreSQL
> {code}
> function lpad(unknown, numeric, unknown) does not exist
> {code}
>  
> Expected output:
> In Spark also it should throw Exception like Hive
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29854) lpad and rpad built in function not throw Exception for invalid len value

2019-11-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29854:
-
Description: 
Spark Returns Empty String)

{code}
0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
lpad('hihhh', 5000, '');
 ++
|lpad(hihhh, CAST(5000 AS INT), 
)|

++


++
{code}

Hive:

{code}
SELECT lpad('hihhh', 5000, 
'');
 Error: Error while compiling statement: FAILED: SemanticException [Error 
10016]: Line 1:67 Argument type mismatch '''': lpad only takes 
INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016)
{code}

PostgreSQL

{code}
function lpad(unknown, numeric, unknown) does not exist
{code}

 

Expected output:

In Spark also it should throw Exception like Hive

 

  was:
Spark Returns Empty String)

0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
lpad('hihhh', 5000, '');
 ++
|lpad(hihhh, CAST(5000 AS INT), 
)|

++


++

Hive:

SELECT lpad('hihhh', 5000, 
'');
 Error: Error while compiling statement: FAILED: SemanticException [Error 
10016]: Line 1:67 Argument type mismatch '''': lpad only takes 
INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016)

PostgreSQL

function lpad(unknown, numeric, unknown) does not exist

 

Expected output:

In Spark also it should throw Exception like Hive

 


> lpad and rpad built in function not throw Exception for invalid len value
> -
>
> Key: SPARK-29854
> URL: https://issues.apache.org/jira/browse/SPARK-29854
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> Spark Returns Empty String)
> {code}
> 0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
> lpad('hihhh', 5000, '');
>  ++
> |lpad(hihhh, CAST(5000 AS INT), 
> )|
> ++
> ++
> {code}
> Hive:
> {code}
> SELECT lpad('hihhh', 5000, 
> '');
>  Error: Error while compiling statement: FAILED: SemanticException [Error 
> 10016]: Line 1:67 Argument type mismatch '''': lpad only takes 
> INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016)
> {code}
> PostgreSQL
> {code}
> function lpad(unknown, numeric, unknown) does not exist
> {code}
>  
> Expected output:
> In Spark also it should throw Exception like Hive
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29771) Limit executor max failures before failing the application

2019-11-12 Thread Jackey Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jackey Lee updated SPARK-29771:
---
Description: 
ExecutorPodsAllocator does not limit the number of executor errors or 
deletions, which may cause executor restart continuously without application 
failure.
A simple example for this, add {{--conf spark.executor.extraJavaOptions=-Xmse}} 
after spark-submit, which can make executor restart thousands of times without 
application failure.

  was:
At present, K8S scheduling does not limit the number of failures of the 
executors, which may cause executor retried continuously without failing.
A simple example, we add a resource limit on default namespace. After the 
driver is started, if the quota is full, the executor will retry the creation 
continuously, resulting in a large amount of pod information accumulation. When 
many applications encounter such situations, they will affect the K8S cluster.


> Limit executor max failures before failing the application
> --
>
> Key: SPARK-29771
> URL: https://issues.apache.org/jira/browse/SPARK-29771
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Jackey Lee
>Priority: Major
>
> ExecutorPodsAllocator does not limit the number of executor errors or 
> deletions, which may cause executor restart continuously without application 
> failure.
> A simple example for this, add {{--conf 
> spark.executor.extraJavaOptions=-Xmse}} after spark-submit, which can make 
> executor restart thousands of times without application failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-29771) Limit executor max failures before failing the application

2019-11-12 Thread Jackey Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jackey Lee updated SPARK-29771:
---
Comment: was deleted

(was: This patch is mainly used in the scenario where the executor started 
failed. The executor runtime failure, which is caused by task errors is 
controlled by spark.executor.maxFailures.

Another Example, add `--conf spark.executor.extraJavaOptions=-Xmse` after 
spark-submit, which can also appear executor crazy retry.)

> Limit executor max failures before failing the application
> --
>
> Key: SPARK-29771
> URL: https://issues.apache.org/jira/browse/SPARK-29771
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Jackey Lee
>Priority: Major
>
> ExecutorPodsAllocator does not limit the number of executor errors or 
> deletions, which may cause executor restart continuously without application 
> failure.
> A simple example for this, add {{--conf 
> spark.executor.extraJavaOptions=-Xmse}} after spark-submit, which can make 
> executor restart thousands of times without application failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29349) Support FETCH_PRIOR in Thriftserver query results fetching

2019-11-12 Thread runzhiwang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972345#comment-16972345
 ] 

runzhiwang commented on SPARK-29349:


[~juliuszsompolski] Hi, Could you explain how to use this feature ? Because 
hive-jdbc does not support  this.

> Support FETCH_PRIOR in Thriftserver query results fetching
> --
>
> Key: SPARK-29349
> URL: https://issues.apache.org/jira/browse/SPARK-29349
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
> Fix For: 3.0.0
>
>
> Support FETCH_PRIOR fetching in Thriftserver



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29349) Support FETCH_PRIOR in Thriftserver query results fetching

2019-11-12 Thread Juliusz Sompolski (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972369#comment-16972369
 ] 

Juliusz Sompolski commented on SPARK-29349:
---

[~runzhiwang] it is used by Simba Spark ODBC driver to recover after a 
connection failure during fetching when AutoReconnect=1. It is available in 
Simba Spark ODBC driver v2.6.10 - I believe it is used only to recover the 
cursor position after reconnecting, not as a user facing feature to allow 
fetching backwards.

> Support FETCH_PRIOR in Thriftserver query results fetching
> --
>
> Key: SPARK-29349
> URL: https://issues.apache.org/jira/browse/SPARK-29349
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
> Fix For: 3.0.0
>
>
> Support FETCH_PRIOR fetching in Thriftserver



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29857) [WEB UI] Support defer render the spark history summary page.

2019-11-12 Thread feiwang (Jira)
feiwang created SPARK-29857:
---

 Summary: [WEB UI] Support defer render the spark history summary 
page. 
 Key: SPARK-29857
 URL: https://issues.apache.org/jira/browse/SPARK-29857
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.4.4
Reporter: feiwang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29823) Improper persist strategy in mllib.clustering.KMeans.run()

2019-11-12 Thread Aman Omer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972473#comment-16972473
 ] 

Aman Omer commented on SPARK-29823:
---

I will check this one.

> Improper persist strategy in mllib.clustering.KMeans.run()
> --
>
> Key: SPARK-29823
> URL: https://issues.apache.org/jira/browse/SPARK-29823
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Major
>
> In mllib.clustering.KMeans.run(), the rdd {color:#de350b}_norms_{color} is 
> persisted. But {color:#de350b}_norms_ {color}only has a single child, i.e., 
> the rdd {color:#de350b}_zippedData_ {color:#172b4d}which was not 
> persisted{color}{color}{color:#172b4d}.{color} So all the actions that reply 
> on _norms_ also reply on _{color:#de350b}zippedData{color}._ The rdd 
> {color:#de350b}_zippedData_{color} will be used by multiple times in 
> _runAlgorithm()._ Therefore _{color:#de350b}zippedData{color}_ should be 
> persisted, not  _{color:#de350b}norms{color}_.
> {code:scala}
>   private[spark] def run(
>   data: RDD[Vector],
>   instr: Option[Instrumentation]): KMeansModel = {
> if (data.getStorageLevel == StorageLevel.NONE) {
>   logWarning("The input data is not directly cached, which may hurt 
> performance if its"
> + " parent RDDs are also uncached.")
> }
> // Compute squared norms and cache them.
> val norms = data.map(Vectors.norm(_, 2.0))
> norms.persist() // Unnecessary persist. Only used to generate zippedData.
> val zippedData = data.zip(norms).map { case (v, norm) =>
>   new VectorWithNorm(v, norm)
> } // needs to persist
> val model = runAlgorithm(zippedData, instr)
> norms.unpersist() // Change to zippedData.unpersist()
> {code}
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29855) typed literals with negative sign with proper result or exception

2019-11-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-29855:


Assignee: Kent Yao

> typed literals with negative sign with proper result or exception
> -
>
> Key: SPARK-29855
> URL: https://issues.apache.org/jira/browse/SPARK-29855
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> {code:java}
> -- !query 83
> select -integer '7'
> -- !query 83 schema
> struct<7:int>
> -- !query 83 output
> 7
> -- !query 86
> select -date '1999-01-01'
> -- !query 86 schema
> struct
> -- !query 86 output
> 1999-01-01
> -- !query 87
> select -timestamp '1999-01-01'
> -- !query 87 schema
> struct
> -- !query 87 output
> 1999-01-01 00:00:00
> {code}
> the integer should be -7 and the date and timestamp results are confusing 
> which should throw exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29855) typed literals with negative sign with proper result or exception

2019-11-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29855.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26479
[https://github.com/apache/spark/pull/26479]

> typed literals with negative sign with proper result or exception
> -
>
> Key: SPARK-29855
> URL: https://issues.apache.org/jira/browse/SPARK-29855
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> {code:java}
> -- !query 83
> select -integer '7'
> -- !query 83 schema
> struct<7:int>
> -- !query 83 output
> 7
> -- !query 86
> select -date '1999-01-01'
> -- !query 86 schema
> struct
> -- !query 86 output
> 1999-01-01
> -- !query 87
> select -timestamp '1999-01-01'
> -- !query 87 schema
> struct
> -- !query 87 output
> 1999-01-01 00:00:00
> {code}
> the integer should be -7 and the date and timestamp results are confusing 
> which should throw exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29854) lpad and rpad built in function not throw Exception for invalid len value

2019-11-12 Thread ABHISHEK KUMAR GUPTA (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ABHISHEK KUMAR GUPTA updated SPARK-29854:
-
Description: 
Spark Returns Empty String)

{code}
0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
lpad('hihhh', 5000, '');
 ++
|lpad(hihhh, CAST(5000 AS INT), 
)|

++


++


Hive:


SELECT lpad('hihhh', 5000, 
'');
 Error: Error while compiling statement: FAILED: SemanticException [Error 
10016]: Line 1:67 Argument type mismatch '''': lpad only takes 
INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016)


PostgreSQL


function lpad(unknown, numeric, unknown) does not exist


 

Expected output:

In Spark also it should throw Exception like Hive
{code}
 

  was:
Spark Returns Empty String)

{code}
0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
lpad('hihhh', 5000, '');
 ++
|lpad(hihhh, CAST(5000 AS INT), 
)|

++


++
{code}

Hive:

{code}
SELECT lpad('hihhh', 5000, 
'');
 Error: Error while compiling statement: FAILED: SemanticException [Error 
10016]: Line 1:67 Argument type mismatch '''': lpad only takes 
INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016)
{code}

PostgreSQL

{code}
function lpad(unknown, numeric, unknown) does not exist
{code}

 

Expected output:

In Spark also it should throw Exception like Hive

 


> lpad and rpad built in function not throw Exception for invalid len value
> -
>
> Key: SPARK-29854
> URL: https://issues.apache.org/jira/browse/SPARK-29854
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> Spark Returns Empty String)
> {code}
> 0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
> lpad('hihhh', 5000, '');
>  ++
> |lpad(hihhh, CAST(5000 AS INT), 
> )|
> ++
> ++
> Hive:
> SELECT lpad('hihhh', 5000, 
> '');
>  Error: Error while compiling statement: FAILED: SemanticException [Error 
> 10016]: Line 1:67 Argument type mismatch '''': lpad only takes 
> INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016)
> PostgreSQL
> function lpad(unknown, numeric, unknown) does not exist
>  
> Expected output:
> In Spark also it should throw Exception like Hive
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29710) Seeing offsets not resetting even when reset policy is configured explicitly

2019-11-12 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972515#comment-16972515
 ] 

Gabor Somogyi commented on SPARK-29710:
---

[~BdLearner] bump, can you provide any useful information like logs? Otherwise 
I'm going to close this as incomplete.

> Seeing offsets not resetting even when reset policy is configured explicitly
> 
>
> Key: SPARK-29710
> URL: https://issues.apache.org/jira/browse/SPARK-29710
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.1
> Environment: Window10 , eclipse neos
>Reporter: Shyam
>Priority: Major
>
>  
>  even after setting *"auto.offset.reset" to "latest"*  I am getting below 
> error
>  
> org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of 
> range with no configured reset policy for partitions: 
> \{COMPANY_TRANSACTIONS_INBOUND-16=168}org.apache.kafka.clients.consumer.OffsetOutOfRangeException:
>  Offsets out of range with no configured reset policy for partitions: 
> \{COMPANY_TRANSACTIONS_INBOUND-16=168} at 
> org.apache.kafka.clients.consumer.internals.Fetcher.throwIfOffsetOutOfRange(Fetcher.java:348)
>  at 
> org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:396)
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:999)
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:937) 
> at 
> org.apache.spark.sql.kafka010.InternalKafkaConsumer.fetchData(KafkaDataConsumer.scala:470)
>  at 
> org.apache.spark.sql.kafka010.InternalKafkaConsumer.org$apache$spark$sql$kafka010$InternalKafkaConsumer$$fetchRecord(KafkaDataConsumer.scala:361)
>  at 
> org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:251)
>  at 
> org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:234)
>  at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>  at 
> org.apache.spark.sql.kafka010.InternalKafkaConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:209)
>  at 
> org.apache.spark.sql.kafka010.InternalKafkaConsumer.get(KafkaDataConsumer.scala:234)
>  
> [https://stackoverflow.com/questions/58653885/even-after-setting-auto-offset-reset-to-latest-getting-error-offsetoutofrang]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20050) Kafka 0.10 DirectStream doesn't commit last processed batch's offset when graceful shutdown

2019-11-12 Thread Julian Eberius (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-20050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972524#comment-16972524
 ] 

Julian Eberius commented on SPARK-20050:


Sorry to comment on an old, closed issue, but: what is the solution for this 
issue? I have noticed the same issue: the offsets of a last batch of a 
Spark/Kafka streaming app is never commited to Kafka, as commitAll() is only 
called in compute(). In effect, offsets commited by the user via the 
CanCommitOffsets interface are actually only ever commited in the following 
batch. If there is no following batch, nothing is commited. Therefore, on the 
next streaming job instance, duplicate processing will occur. The Github PR 
therefore makes a lot of sense to me. 

Is there anything that I'm missing? What is the recommended way to commit 
offsets into Kafka on Spark Streaming application shutdown?

> Kafka 0.10 DirectStream doesn't commit last processed batch's offset when 
> graceful shutdown
> ---
>
> Key: SPARK-20050
> URL: https://issues.apache.org/jira/browse/SPARK-20050
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Sasaki Toru
>Priority: Major
>
> I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and 
> call 'DirectKafkaInputDStream#commitAsync' finally in each batches,  such 
> below
> {code}
> val kafkaStream = KafkaUtils.createDirectStream[String, String](...)
> kafkaStream.map { input =>
>   "key: " + input.key.toString + " value: " + input.value.toString + " 
> offset: " + input.offset.toString
>   }.foreachRDD { rdd =>
> rdd.foreach { input =>
> println(input)
>   }
> }
> kafkaStream.foreachRDD { rdd =>
>   val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
>   kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
> }
> {\code}
> Some records which processed in the last batch before Streaming graceful 
> shutdown reprocess in the first batch after Spark Streaming restart, such 
> below
> * output first run of this application
> {code}
> key: null value: 1 offset: 101452472
> key: null value: 2 offset: 101452473
> key: null value: 3 offset: 101452474
> key: null value: 4 offset: 101452475
> key: null value: 5 offset: 101452476
> key: null value: 6 offset: 101452477
> key: null value: 7 offset: 101452478
> key: null value: 8 offset: 101452479
> key: null value: 9 offset: 101452480  // this is a last record before 
> shutdown Spark Streaming gracefully
> {\code}
> * output re-run of this application
> {code}
> key: null value: 7 offset: 101452478   // duplication
> key: null value: 8 offset: 101452479   // duplication
> key: null value: 9 offset: 101452480   // duplication
> key: null value: 10 offset: 101452481
> {\code}
> It may cause offsets specified in commitAsync will commit in the head of next 
> batch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29858) ALTER DATABASE (SET DBPROPERTIES) should look up catalog like v2 commands

2019-11-12 Thread Hu Fuwang (Jira)
Hu Fuwang created SPARK-29858:
-

 Summary: ALTER DATABASE (SET DBPROPERTIES) should look up catalog 
like v2 commands
 Key: SPARK-29858
 URL: https://issues.apache.org/jira/browse/SPARK-29858
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hu Fuwang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29858) ALTER DATABASE (SET DBPROPERTIES) should look up catalog like v2 commands

2019-11-12 Thread Hu Fuwang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972530#comment-16972530
 ] 

Hu Fuwang commented on SPARK-29858:
---

Working on this.

> ALTER DATABASE (SET DBPROPERTIES) should look up catalog like v2 commands
> -
>
> Key: SPARK-29858
> URL: https://issues.apache.org/jira/browse/SPARK-29858
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hu Fuwang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29859) ALTER DATABASE (SET LOCATION) should look up catalog like v2 commands

2019-11-12 Thread Hu Fuwang (Jira)
Hu Fuwang created SPARK-29859:
-

 Summary: ALTER DATABASE (SET LOCATION) should look up catalog like 
v2 commands
 Key: SPARK-29859
 URL: https://issues.apache.org/jira/browse/SPARK-29859
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hu Fuwang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29859) ALTER DATABASE (SET LOCATION) should look up catalog like v2 commands

2019-11-12 Thread Hu Fuwang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972532#comment-16972532
 ] 

Hu Fuwang commented on SPARK-29859:
---

Working on this.

> ALTER DATABASE (SET LOCATION) should look up catalog like v2 commands
> -
>
> Key: SPARK-29859
> URL: https://issues.apache.org/jira/browse/SPARK-29859
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hu Fuwang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29860) [SQL] Fix data type mismatch issue for inSubQuery

2019-11-12 Thread feiwang (Jira)
feiwang created SPARK-29860:
---

 Summary: [SQL] Fix data type mismatch issue for inSubQuery
 Key: SPARK-29860
 URL: https://issues.apache.org/jira/browse/SPARK-29860
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.4
Reporter: feiwang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29861) Reduce leader election downtime in Spark standalone HA

2019-11-12 Thread Robin Wolters (Jira)
Robin Wolters created SPARK-29861:
-

 Summary: Reduce leader election downtime in Spark standalone HA
 Key: SPARK-29861
 URL: https://issues.apache.org/jira/browse/SPARK-29861
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.1
Reporter: Robin Wolters


As officially stated in the spark [HA 
documention|https://spark.apache.org/docs/latest/spark-standalone.html#standby-masters-with-zookeeper],
 the recovery process of Spark (standalone) master in HA with zookeeper takes 
about 1-2 minutes. During this time no spark master is active, which makes 
interaction with spark essentially impossible. 

After looking for a way to reduce this downtime, it seems that this is mainly 
caused by the leader election, which waits for open zookeeper connections to be 
closed. This seems like an unnecessary downtime for example in case of a 
planned VM update.

I have fixed this in my setup by:
 # Closing open zookeeper connections during spark shutdown
 # Bumping the curator version and implementing a custom error policy that is 
tolerant to a zookeeper connection suspension.

I am preparing a pull request for review / further discussion on this issue.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29860) [SQL] Fix data type mismatch issue for inSubQuery

2019-11-12 Thread feiwang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29860:

Description: 
The follow statement would throw an exception.


{code:java}
  sql("create table ta(id Decimal(18,0)) using parquet")
  sql("create table tb(id Decimal(19,0)) using parquet")
  sql("select * from ta where id in (select id from tb)").shown()
{code}


{code:java}
// Some comments here
public String getFoo()
{
return foo;
}
{code}



  was:
The follow statement would throw an exception.


{code:java}
  sql("create table ta(id Decimal(18,0)) using parquet")
  sql("create table tb(id Decimal(19,0)) using parquet")
  sql("select * from ta where id in (select id from tb)").shown()
{code}

cannot resolve '(default.ta.`id` IN (listquery()))' due to data type mismatch: 
The data type of one or more elements in the left hand side of an IN subquery
is not compatible with the data type of the output of the subquery
Mismatched columns:
[(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))]
Left side:
[decimal(18,0)].
Right side:
[decimal(19,0)].;;
'Project [*]
+- 'Filter id#219 IN (list#218 [])
   :  +- Project [id#220]
   : +- SubqueryAlias `default`.`tb`
   :+- Relation[id#220] parquet
   +- SubqueryAlias `default`.`ta`
  +- Relation[id#219] parquet



> [SQL] Fix data type mismatch issue for inSubQuery
> -
>
> Key: SPARK-29860
> URL: https://issues.apache.org/jira/browse/SPARK-29860
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
>
> The follow statement would throw an exception.
> {code:java}
>   sql("create table ta(id Decimal(18,0)) using parquet")
>   sql("create table tb(id Decimal(19,0)) using parquet")
>   sql("select * from ta where id in (select id from tb)").shown()
> {code}
> {code:java}
> // Some comments here
> public String getFoo()
> {
> return foo;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29860) [SQL] Fix data type mismatch issue for inSubQuery

2019-11-12 Thread feiwang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29860:

Description: 
The follow statement would throw an exception.


{code:java}
  sql("create table ta(id Decimal(18,0)) using parquet")
  sql("create table tb(id Decimal(19,0)) using parquet")
  sql("select * from ta where id in (select id from tb)").shown()
{code}


{code:java}
// Some comments here
cannot resolve '(default.ta.`id` IN (listquery()))' due to data type mismatch: 
The data type of one or more elements in the left hand side of an IN subquery
is not compatible with the data type of the output of the subquery
Mismatched columns:
[(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))]
Left side:
[decimal(18,0)].
Right side:
[decimal(19,0)].;;
'Project [*]
+- 'Filter id#219 IN (list#218 [])
   :  +- Project [id#220]
   : +- SubqueryAlias `default`.`tb`
   :+- Relation[id#220] parquet
   +- SubqueryAlias `default`.`ta`
  +- Relation[id#219] parquet
{code}



  was:
The follow statement would throw an exception.


{code:java}
  sql("create table ta(id Decimal(18,0)) using parquet")
  sql("create table tb(id Decimal(19,0)) using parquet")
  sql("select * from ta where id in (select id from tb)").shown()
{code}


{code:java}
// Some comments here
public String getFoo()
{
return foo;
}
{code}




> [SQL] Fix data type mismatch issue for inSubQuery
> -
>
> Key: SPARK-29860
> URL: https://issues.apache.org/jira/browse/SPARK-29860
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
>
> The follow statement would throw an exception.
> {code:java}
>   sql("create table ta(id Decimal(18,0)) using parquet")
>   sql("create table tb(id Decimal(19,0)) using parquet")
>   sql("select * from ta where id in (select id from tb)").shown()
> {code}
> {code:java}
> // Some comments here
> cannot resolve '(default.ta.`id` IN (listquery()))' due to data type 
> mismatch: 
> The data type of one or more elements in the left hand side of an IN subquery
> is not compatible with the data type of the output of the subquery
> Mismatched columns:
> [(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))]
> Left side:
> [decimal(18,0)].
> Right side:
> [decimal(19,0)].;;
> 'Project [*]
> +- 'Filter id#219 IN (list#218 [])
>:  +- Project [id#220]
>: +- SubqueryAlias `default`.`tb`
>:+- Relation[id#220] parquet
>+- SubqueryAlias `default`.`ta`
>   +- Relation[id#219] parquet
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29860) [SQL] Fix data type mismatch issue for inSubQuery

2019-11-12 Thread feiwang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29860:

Description: 
The follow statement would throw an exception.


{code:java}
  sql("create table ta(id Decimal(18,0)) using parquet")
  sql("create table tb(id Decimal(19,0)) using parquet")
  sql("select * from ta where id in (select id from tb)").shown()
{code}

cannot resolve '(default.ta.`id` IN (listquery()))' due to data type mismatch: 
The data type of one or more elements in the left hand side of an IN subquery
is not compatible with the data type of the output of the subquery
Mismatched columns:
[(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))]
Left side:
[decimal(18,0)].
Right side:
[decimal(19,0)].;;
'Project [*]
+- 'Filter id#219 IN (list#218 [])
   :  +- Project [id#220]
   : +- SubqueryAlias `default`.`tb`
   :+- Relation[id#220] parquet
   +- SubqueryAlias `default`.`ta`
  +- Relation[id#219] parquet


> [SQL] Fix data type mismatch issue for inSubQuery
> -
>
> Key: SPARK-29860
> URL: https://issues.apache.org/jira/browse/SPARK-29860
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
>
> The follow statement would throw an exception.
> {code:java}
>   sql("create table ta(id Decimal(18,0)) using parquet")
>   sql("create table tb(id Decimal(19,0)) using parquet")
>   sql("select * from ta where id in (select id from tb)").shown()
> {code}
> cannot resolve '(default.ta.`id` IN (listquery()))' due to data type 
> mismatch: 
> The data type of one or more elements in the left hand side of an IN subquery
> is not compatible with the data type of the output of the subquery
> Mismatched columns:
> [(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))]
> Left side:
> [decimal(18,0)].
> Right side:
> [decimal(19,0)].;;
> 'Project [*]
> +- 'Filter id#219 IN (list#218 [])
>:  +- Project [id#220]
>: +- SubqueryAlias `default`.`tb`
>:+- Relation[id#220] parquet
>+- SubqueryAlias `default`.`ta`
>   +- Relation[id#219] parquet



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29860) [SQL] Fix data type mismatch issue for inSubQuery

2019-11-12 Thread feiwang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29860:

Description: 
The follow statement would throw an exception.


{code:java}
  sql("create table ta(id Decimal(18,0)) using parquet")
  sql("create table tb(id Decimal(19,0)) using parquet")
  sql("select * from ta where id in (select id from tb)").shown()
{code}


{code:java}
// Exception information
cannot resolve '(default.ta.`id` IN (listquery()))' due to data type mismatch: 
The data type of one or more elements in the left hand side of an IN subquery
is not compatible with the data type of the output of the subquery
Mismatched columns:
[(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))]
Left side:
[decimal(18,0)].
Right side:
[decimal(19,0)].;;
'Project [*]
+- 'Filter id#219 IN (list#218 [])
   :  +- Project [id#220]
   : +- SubqueryAlias `default`.`tb`
   :+- Relation[id#220] parquet
   +- SubqueryAlias `default`.`ta`
  +- Relation[id#219] parquet
{code}



  was:
The follow statement would throw an exception.


{code:java}
  sql("create table ta(id Decimal(18,0)) using parquet")
  sql("create table tb(id Decimal(19,0)) using parquet")
  sql("select * from ta where id in (select id from tb)").shown()
{code}


{code:java}
// Some comments here
cannot resolve '(default.ta.`id` IN (listquery()))' due to data type mismatch: 
The data type of one or more elements in the left hand side of an IN subquery
is not compatible with the data type of the output of the subquery
Mismatched columns:
[(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))]
Left side:
[decimal(18,0)].
Right side:
[decimal(19,0)].;;
'Project [*]
+- 'Filter id#219 IN (list#218 [])
   :  +- Project [id#220]
   : +- SubqueryAlias `default`.`tb`
   :+- Relation[id#220] parquet
   +- SubqueryAlias `default`.`ta`
  +- Relation[id#219] parquet
{code}




> [SQL] Fix data type mismatch issue for inSubQuery
> -
>
> Key: SPARK-29860
> URL: https://issues.apache.org/jira/browse/SPARK-29860
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
>
> The follow statement would throw an exception.
> {code:java}
>   sql("create table ta(id Decimal(18,0)) using parquet")
>   sql("create table tb(id Decimal(19,0)) using parquet")
>   sql("select * from ta where id in (select id from tb)").shown()
> {code}
> {code:java}
> // Exception information
> cannot resolve '(default.ta.`id` IN (listquery()))' due to data type 
> mismatch: 
> The data type of one or more elements in the left hand side of an IN subquery
> is not compatible with the data type of the output of the subquery
> Mismatched columns:
> [(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))]
> Left side:
> [decimal(18,0)].
> Right side:
> [decimal(19,0)].;;
> 'Project [*]
> +- 'Filter id#219 IN (list#218 [])
>:  +- Project [id#220]
>: +- SubqueryAlias `default`.`tb`
>:+- Relation[id#220] parquet
>+- SubqueryAlias `default`.`ta`
>   +- Relation[id#219] parquet
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29862) CREATE (OR REPLACE) ... VIEW should look up catalog/table like v2 commands

2019-11-12 Thread Huaxin Gao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972560#comment-16972560
 ] 

Huaxin Gao commented on SPARK-29862:


I will work on this. 

> CREATE (OR REPLACE) ... VIEW should look up catalog/table like v2 commands
> --
>
> Key: SPARK-29862
> URL: https://issues.apache.org/jira/browse/SPARK-29862
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Major
>
> CREATE (OR REPLACE) ... VIEW should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29862) CREATE (OR REPLACE) ... VIEW should look up catalog/table like v2 commands

2019-11-12 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-29862:
--

 Summary: CREATE (OR REPLACE) ... VIEW should look up catalog/table 
like v2 commands
 Key: SPARK-29862
 URL: https://issues.apache.org/jira/browse/SPARK-29862
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Huaxin Gao


CREATE (OR REPLACE) ... VIEW should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29857) [WEB UI] Support defer render the spark history summary page.

2019-11-12 Thread feiwang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29857:

Description: 
When there are many applications in spark history server, the renderer of 
history summary page is heavy, we can enable deferRender to tuning it.

See details https://datatables.net/examples/ajax/defer_render.html

> [WEB UI] Support defer render the spark history summary page. 
> --
>
> Key: SPARK-29857
> URL: https://issues.apache.org/jira/browse/SPARK-29857
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Minor
>
> When there are many applications in spark history server, the renderer of 
> history summary page is heavy, we can enable deferRender to tuning it.
> See details https://datatables.net/examples/ajax/defer_render.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file

2019-11-12 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972605#comment-16972605
 ] 

Felix Kizhakkel Jose commented on SPARK-29764:
--

So it's still hard to follow the reproducer? Do I need to skim it further?

> Error on Serializing POJO with java datetime property to a Parquet file
> ---
>
> Key: SPARK-29764
> URL: https://issues.apache.org/jira/browse/SPARK-29764
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core, SQL
>Affects Versions: 2.4.4
>Reporter: Felix Kizhakkel Jose
>Priority: Major
> Attachments: SparkParquetSampleCode.docx
>
>
> Hello,
>  I have been doing a proof of concept for data lake structure and analytics 
> using Apache Spark. 
>  When I add a java java.time.LocalDateTime/java.time.LocalDate properties in 
> my data model, the serialization to Parquet start failing.
>  *My Data Model:*
> @Data
>  public class Employee
> { private UUID id = UUID.randomUUID(); private String name; private int age; 
> private LocalDate dob; private LocalDateTime startDateTime; private String 
> phone; private Address address; }
>  
>  *Serialization Snippet*
> {color:#0747a6}public void serialize(){color}
> {color:#0747a6}{ List inputDataToSerialize = 
> getInputDataToSerialize(); // this creates 100,000 employee objects 
> Encoder employeeEncoder = Encoders.bean(Employee.class); 
> Dataset employeeDataset = sparkSession.createDataset( 
> inputDataToSerialize, employeeEncoder ); employeeDataset.write() 
> .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); 
> }{color}
> +*Exception Stack Trace:*
>  +
>  *java.lang.IllegalStateException: Failed to execute 
> CommandLineRunnerjava.lang.IllegalStateException: Failed to execute 
> CommandLineRunner at 
> org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803)
>  at 
> org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784)
>  at 
> org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771)
>  at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) 
> at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) 
> at com.felix.Application.main(Application.java:45)Caused by: 
> org.apache.spark.SparkException: Job aborted. at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) 
> at 
> org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
>  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at 
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566) at 
> com.felix.SparkParquetSerializer.serialize(SparkParquetSerializer.java:24) at 
> com.felix.Application.run(Application.java:63) at 
> org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:800)
>  ... 6 moreCaused by: org.apache.spark.SparkException: Job aborted due to 
> stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: 

[jira] [Commented] (SPARK-29762) GPU Scheduling - default task resource amount to 1

2019-11-12 Thread Imran Rashid (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972631#comment-16972631
 ] 

Imran Rashid commented on SPARK-29762:
--

I don't really understand the complication.  I know there would be some special 
casing for GPUs in the config parsing code (eg. in 
{{org.apache.spark.resource.ResourceUtils#parseResourceRequirements}}), but 
doesn't seem anything too bad.

I did think about this more, and realize it gets a bit confusing when you add 
in task-level resource constraints.  you won't schedule optimally for tasks 
that don't need gpu, and you won't have gpus leftover for the tasks that do 
need them.  Eg, say you had each executor setup with 4 cores and 2 gpus.  If 
you had one task set come in which only needed cpu, you would only run 2 
copies.  And then if another taskset came in which did need the gpus, you 
woudn't be able to schedule it.

You can't end up in that situation until you have task-specific resource 
constraints.  But does it get too messy to have sensible defaults in that 
situation?  Maybe the user specifies gpus as an executor resource up front, for 
the whole cluster, because they have them available and they know some 
significant fraction of the workloads need them.  They might think that the 
regular tasks will just ignore the gpus, and the tasks that do need gpus would 
just specify them as task-level constraints.

I guess this might have been a bad suggestion after all, sorry.

> GPU Scheduling - default task resource amount to 1
> --
>
> Key: SPARK-29762
> URL: https://issues.apache.org/jira/browse/SPARK-29762
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> Default the task level resource configs (for gpu/fpga, etc) to 1.  So if the 
> user specifies the executor resource then to make it more user friendly lets 
> have the task resource config default to 1.  This is ok right now since we 
> require resources to have an address.  It also matches what we do for the 
> spark.task.cpus configs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29850) sort-merge-join an empty table should not memory leak

2019-11-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29850.
-
Fix Version/s: 3.0.0
   2.4.5
   Resolution: Fixed

Issue resolved by pull request 26471
[https://github.com/apache/spark/pull/26471]

> sort-merge-join an empty table should not memory leak
> -
>
> Key: SPARK-29850
> URL: https://issues.apache.org/jira/browse/SPARK-29850
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29863) rename EveryAgg/AnyAgg to BoolAnd/BoolOr

2019-11-12 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-29863:
---

 Summary: rename EveryAgg/AnyAgg to BoolAnd/BoolOr
 Key: SPARK-29863
 URL: https://issues.apache.org/jira/browse/SPARK-29863
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29762) GPU Scheduling - default task resource amount to 1

2019-11-12 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972661#comment-16972661
 ] 

Thomas Graves commented on SPARK-29762:
---

The code that gets the task requirements is generic and is reused by more then 
just TASKs. Also there is code that loops over the task requirements and assume 
all task requirements are there, but you can't do that if you are relying on a 
default value because they won't be present.  At some point you either have to 
look at all executor requirements and then build up the task requirements that 
include the default ones, or you just have to change those loops to be over the 
executor requirements.

I'm not saying its not possible, my comment was just saying it isn't as 
straight forward as I originally thought. 

I don't think you have the issue you are talking about because you have to have 
the task requirements be based on the executor resource requests (which is part 
of complication).  as long as you do that with the stage level scheduling I 
think its ok.   ResourceProfile ExecutorRequest = 1 cpu, 1 GPU, user defined 
taskRequest = null, then translate into taskRequest = 1 cpu, 1 GPU. 

 

 

> GPU Scheduling - default task resource amount to 1
> --
>
> Key: SPARK-29762
> URL: https://issues.apache.org/jira/browse/SPARK-29762
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> Default the task level resource configs (for gpu/fpga, etc) to 1.  So if the 
> user specifies the executor resource then to make it more user friendly lets 
> have the task resource config default to 1.  This is ok right now since we 
> require resources to have an address.  It also matches what we do for the 
> spark.task.cpus configs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29864) Strict parsing of day-time strings to intervals

2019-11-12 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-29864:
--

 Summary: Strict parsing of day-time strings to intervals
 Key: SPARK-29864
 URL: https://issues.apache.org/jira/browse/SPARK-29864
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Currently, the IntervalUtils.fromDayTimeString() method does not takes into 
account the left bound `from` and truncates the result using the right bound 
`to`. The method should respect to the bounds specified by an user.

Oracle and MySQL respect to user's bounds, see 
https://github.com/apache/spark/pull/26358#issuecomment-551942719 and 
https://github.com/apache/spark/pull/26358#issuecomment-549272475 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark

2019-11-12 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972697#comment-16972697
 ] 

Shane Knapp commented on SPARK-29106:
-

i would definitely be happy to have a faster VM available for testing...  5h33m 
is a major improvement over 10h.  :)

at the same time, i'm working w/my lead sysadmin and we will most likely be 
purchasing an actual aarch64/ARM server for our build system.  the ETA for this 
will be most likely in early 2020, so we can start testing on the new VM as 
soon as you're ready.

> Add jenkins arm test for spark
> --
>
> Key: SPARK-29106
> URL: https://issues.apache.org/jira/browse/SPARK-29106
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Minor
> Attachments: R-ansible.yml, R-libs.txt, arm-python36.txt
>
>
> Add arm test jobs to amplab jenkins for spark.
> Till now we made two arm test periodic jobs for spark in OpenLab, one is 
> based on master with hadoop 2.7(similar with QA test of amplab jenkins), 
> other one is based on a new branch which we made on date 09-09, see  
> [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64]
>   and 
> [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64]
>  We only have to care about the first one when integrate arm test with amplab 
> jenkins.
> About the k8s test on arm, we have took test it, see 
> [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it 
> later. 
> And we plan test on other stable branches too, and we can integrate them to 
> amplab when they are ready.
> We have offered an arm instance and sent the infos to shane knapp, thanks 
> shane to add the first arm job to amplab jenkins :) 
> The other important thing is about the leveldbjni 
> [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80]
>  spark depends on leveldbjni-all-1.8 
> [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8],
>  we can see there is no arm64 supporting. So we build an arm64 supporting 
> release of leveldbjni see 
> [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8],
>  but we can't modified the spark pom.xml directly with something like 
> 'property'/'profile' to choose correct jar package on arm or x86 platform, 
> because spark depends on some hadoop packages like hadoop-hdfs, the packages 
> depend on leveldbjni-all-1.8 too, unless hadoop release with new arm 
> supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of 
> openlabtesting and 'mvn install' to use it when arm testing for spark.
> PS: The issues found and fixed:
>  SPARK-28770
>  [https://github.com/apache/spark/pull/25673]
>   
>  SPARK-28519
>  [https://github.com/apache/spark/pull/25279]
>   
>  SPARK-28433
>  [https://github.com/apache/spark/pull/25186]
>  
> SPARK-28467
> [https://github.com/apache/spark/pull/25864]
>  
> SPARK-29286
> [https://github.com/apache/spark/pull/26021]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29762) GPU Scheduling - default task resource amount to 1

2019-11-12 Thread Imran Rashid (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972703#comment-16972703
 ] 

Imran Rashid commented on SPARK-29762:
--

The scenario I was thinking is:

application is submitted with app-wide resource requirements for 4 cores & 2 
gpus executors, but not for tasks.  (I guess they specify 
"spark.executor.resource.gpu" but not "spark.task.resource.gpu".).

One taskset is submitted with no task resource requirements, because it doesn't 
need any gpus.

Another taskset is submitted which does require 1 cpu & 1 gpu per task.


The user does it this way because they don't want all of their tasks to use 
gpus, but they know enough of them will need gpus that they'd rather just 
request gpus upfront, than scale up & down two different types of executors.

> GPU Scheduling - default task resource amount to 1
> --
>
> Key: SPARK-29762
> URL: https://issues.apache.org/jira/browse/SPARK-29762
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> Default the task level resource configs (for gpu/fpga, etc) to 1.  So if the 
> user specifies the executor resource then to make it more user friendly lets 
> have the task resource config default to 1.  This is ok right now since we 
> require resources to have an address.  It also matches what we do for the 
> spark.task.cpus configs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29487) Ability to run Spark Kubernetes other than from /opt/spark

2019-11-12 Thread Marcelo Masiero Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-29487.

Resolution: Duplicate

This is basically allowing people to customize their docker images, which is 
SPARK-24655.

> Ability to run Spark Kubernetes other than from /opt/spark
> --
>
> Key: SPARK-29487
> URL: https://issues.apache.org/jira/browse/SPARK-29487
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Submit
>Affects Versions: 2.4.4
>Reporter: Benjamin Miao CAI
>Priority: Minor
>
> On spark kubernetes Dockerfile, the spark binaries are copied to 
> */opt/spark.* 
> If we try to create our own Dockerfile without using */opt/spark* then the 
> image will not run.
> After looking at the source code, it seem that in various places, the path is 
> hard-coded to */opt/spark*
> *Example :*
> Constants.scala :
> {color:#808080}// Spark app configs for containers
>  {color}{color:#80}val {color}SPARK_CONF_VOLUME = 
> {color:#008000}"spark-conf-volume"{color}
>  *{color:#80}val {color}SPARK_CONF_DIR_INTERNAL = 
> {color:#008000}"/opt/spark/conf"{color}*
>  
> Is it possible to make this configurable so we can put spark elsewhere than 
> /opt/.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29865) k8s executor pods all have different prefixes in client mode

2019-11-12 Thread Marcelo Masiero Vanzin (Jira)
Marcelo Masiero Vanzin created SPARK-29865:
--

 Summary: k8s executor pods all have different prefixes in client 
mode
 Key: SPARK-29865
 URL: https://issues.apache.org/jira/browse/SPARK-29865
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.0.0
Reporter: Marcelo Masiero Vanzin


This works in cluster mode since the features set things up so that all 
executor pods have the same name prefix.

But in client mode features are not used; so each executor ends up with a 
different name prefix, which makes debugging a little bit annoying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29866) Upper case enum values

2019-11-12 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-29866:
--

 Summary: Upper case enum values
 Key: SPARK-29866
 URL: https://issues.apache.org/jira/browse/SPARK-29866
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.4
Reporter: Maxim Gekk


Unify naming of enum values and upper case their names.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29797) Read key-value metadata in Parquet files written by Apache Arrow

2019-11-12 Thread Isaac Myers (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972725#comment-16972725
 ] 

Isaac Myers commented on SPARK-29797:
-

To be clear, Arrow does write the metadata because it can be retrieved per the 
example code (attached). The issue lies with Spark's ability to read it. It 
does not need to be associated with the df, though 'df.file_metadata()' would 
be fine. Otherwise, maybe 'spark.read.parquet_metadata()'.

> Read key-value metadata in Parquet files written by Apache Arrow
> 
>
> Key: SPARK-29797
> URL: https://issues.apache.org/jira/browse/SPARK-29797
> Project: Spark
>  Issue Type: New Feature
>  Components: Java API, PySpark
>Affects Versions: 2.4.4
> Environment: Apache Arrow 0.14.1 built on Windows x86. 
>  
>Reporter: Isaac Myers
>Priority: Major
>  Labels: features
> Attachments: minimal_working_example.cpp
>
>
> Key-value (user) metadata written to Parquet file from Apache Arrow c++ is 
> not readable in Spark (PySpark or Java API). I can only find field-level 
> metadata dictionaries in the schema and no other functions in the API that 
> indicate the presence of file-level key-value metadata. The attached code 
> demonstrates creation and retrieval of file-level metadata using the Apache 
> Arrow API.
> {code:java}
> #include #include #include #include #include 
> #include #include 
> #include #include 
> //#include 
> int main(int argc, char* argv[]){ /* Create 
> Parquet File **/ arrow::Status st; 
> arrow::MemoryPool* pool = arrow::default_memory_pool();
>  // Create Schema and fields with metadata 
> std::vector> fields;
>  std::unordered_map a_keyval; a_keyval["unit"] = 
> "sec"; a_keyval["note"] = "not the standard millisecond unit"; 
> arrow::KeyValueMetadata a_md(a_keyval); std::shared_ptr a_field 
> = arrow::field("a", arrow::int16(), false, a_md.Copy()); 
> fields.push_back(a_field);
>  std::unordered_map b_keyval; b_keyval["unit"] = 
> "ft"; arrow::KeyValueMetadata b_md(b_keyval); std::shared_ptr 
> b_field = arrow::field("b", arrow::int16(), false, b_md.Copy()); 
> fields.push_back(b_field);
>  std::shared_ptr schema = arrow::schema(fields);
>  // Add metadata to schema. std::unordered_map 
> schema_keyval; schema_keyval["classification"] = "Type 0"; 
> arrow::KeyValueMetadata schema_md(schema_keyval); schema = 
> schema->AddMetadata(schema_md.Copy());
>  // Build arrays of data and add to Table. const int64_t rowgroup_size = 100; 
> std::vector a_data(rowgroup_size, 0); std::vector 
> b_data(rowgroup_size, 0);
>  for (int16_t i = 0; i < rowgroup_size; i++) { a_data[i] = i; b_data[i] = 
> rowgroup_size - i; }  arrow::Int16Builder a_bldr(pool); arrow::Int16Builder 
> b_bldr(pool); st = a_bldr.Resize(rowgroup_size); if (!st.ok()) return 1; st = 
> b_bldr.Resize(rowgroup_size); if (!st.ok()) return 1;
>  st = a_bldr.AppendValues(a_data); if (!st.ok()) return 1;
>  st = b_bldr.AppendValues(b_data); if (!st.ok()) return 1;
>  std::shared_ptr a_arr_ptr; std::shared_ptr 
> b_arr_ptr;
>  arrow::ArrayVector arr_vec; st = a_bldr.Finish(&a_arr_ptr); if (!st.ok()) 
> return 1; arr_vec.push_back(a_arr_ptr); st = b_bldr.Finish(&b_arr_ptr); if 
> (!st.ok()) return 1; arr_vec.push_back(b_arr_ptr);
>  std::shared_ptr table = arrow::Table::Make(schema, arr_vec);
>  // Test metadata printf("\nMetadata from original schema:\n"); 
> printf("%s\n", schema->metadata()->ToString().c_str()); printf("%s\n", 
> schema->field(0)->metadata()->ToString().c_str()); printf("%s\n", 
> schema->field(1)->metadata()->ToString().c_str());
>  std::shared_ptr table_schema = table->schema(); 
> printf("\nMetadata from schema retrieved from table (should be the 
> same):\n"); printf("%s\n", table_schema->metadata()->ToString().c_str()); 
> printf("%s\n", table_schema->field(0)->metadata()->ToString().c_str()); 
> printf("%s\n", table_schema->field(1)->metadata()->ToString().c_str());
>  // Open file and write table. std::string file_name = "test.parquet"; 
> std::shared_ptr ostream; st = 
> arrow::io::FileOutputStream::Open(file_name, &ostream); if (!st.ok()) return 
> 1;
>  std::unique_ptr writer; 
> std::shared_ptr props = 
> parquet::default_writer_properties(); st = 
> parquet::arrow::FileWriter::Open(*schema, pool, ostream, props, &writer); if 
> (!st.ok()) return 1; st = writer->WriteTable(*table, rowgroup_size); if 
> (!st.ok()) return 1;
>  // Close file and stream. st = writer->Close(); if (!st.ok()) return 1; st = 
> ostream->Close(); if (!st.ok()) return 1;
>  /* Read Parquet File 
> **/
>  // Create new memory pool. Not sure if this is necessary. 
> //arrow::MemoryPool* pool

[jira] [Resolved] (SPARK-29789) should not parse the bucket column name again when creating v2 tables

2019-11-12 Thread Ryan Blue (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved SPARK-29789.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

> should not parse the bucket column name again when creating v2 tables
> -
>
> Key: SPARK-29789
> URL: https://issues.apache.org/jira/browse/SPARK-29789
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29762) GPU Scheduling - default task resource amount to 1

2019-11-12 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972750#comment-16972750
 ] 

Thomas Graves commented on SPARK-29762:
---

Ok, in the scenario above you would need 2 different ResourceProfile's and they 
would not use the same containers.  1 profile would need executors with only 
cpu, the other profile would need executors with cpus and gpus.   At least that 
is how it would be in this initial stage level scheduling proposal. Anything 
more complex could come later.  It does bring up a good point that my base pr 
doesn't have the checks to ensure that though, but that is separate

 

 

 

 

> GPU Scheduling - default task resource amount to 1
> --
>
> Key: SPARK-29762
> URL: https://issues.apache.org/jira/browse/SPARK-29762
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> Default the task level resource configs (for gpu/fpga, etc) to 1.  So if the 
> user specifies the executor resource then to make it more user friendly lets 
> have the task resource config default to 1.  This is ok right now since we 
> require resources to have an address.  It also matches what we do for the 
> spark.task.cpus configs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29762) GPU Scheduling - default task resource amount to 1

2019-11-12 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972755#comment-16972755
 ] 

Thomas Graves commented on SPARK-29762:
---

But like you mention, if we were to support something like that, then the 
default of 1 when task requirement wasn't specified would be an issue.  

> GPU Scheduling - default task resource amount to 1
> --
>
> Key: SPARK-29762
> URL: https://issues.apache.org/jira/browse/SPARK-29762
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> Default the task level resource configs (for gpu/fpga, etc) to 1.  So if the 
> user specifies the executor resource then to make it more user friendly lets 
> have the task resource config default to 1.  This is ok right now since we 
> require resources to have an address.  It also matches what we do for the 
> spark.task.cpus configs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29867) add __repr__ in Python ML Models

2019-11-12 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-29867:
--

 Summary: add __repr__ in Python ML Models
 Key: SPARK-29867
 URL: https://issues.apache.org/jira/browse/SPARK-29867
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: Huaxin Gao


In Python ML Models, some of them have __repr__, others don't. In the doctest, 
when calling Model.setXXX, some of the Models print out the xxxModel... 
correctly, some of them can't because of lacking the  __repr__ method. This 
Jira addresses this issue. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29867) add __repr__ in Python ML Models

2019-11-12 Thread Huaxin Gao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-29867:
---
Description: In Python ML Models, some of them have ___repr, others 
don't. In the doctest, when calling Model.setXXX, some of the Models print out 
the xxxModel... correctly, some of them can't because of lacking the  
repr___ method. This Jira addresses this issue.   (was: In Python ML 
Models, some of them have __repr__, others don't. In the doctest, when calling 
Model.setXXX, some of the Models print out the xxxModel... correctly, some of 
them can't because of lacking the  __repr__ method. This Jira addresses this 
issue. )

> add __repr__ in Python ML Models
> 
>
> Key: SPARK-29867
> URL: https://issues.apache.org/jira/browse/SPARK-29867
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> In Python ML Models, some of them have ___repr, others don't. In the 
> doctest, when calling Model.setXXX, some of the Models print out the 
> xxxModel... correctly, some of them can't because of lacking the  repr___ 
> method. This Jira addresses this issue. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29493) Add MapType support for Arrow Java

2019-11-12 Thread Jalpan Randeri (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972809#comment-16972809
 ] 

Jalpan Randeri commented on SPARK-29493:


Hi Bryan, 

Are you working on this issue? If not i want to raise a pull request for this.

> Add MapType support for Arrow Java
> --
>
> Key: SPARK-29493
> URL: https://issues.apache.org/jira/browse/SPARK-29493
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> This will add MapType support for Arrow in Spark ArrowConverters. This can 
> happen after the Arrow 0.15.0 upgrade, but MapType is not available in 
> pyarrow yet, so pyspark and pandas_udf will come later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow

2019-11-12 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reopened SPARK-25351:
--

reopening, this should be straightforward to add

> Handle Pandas category type when converting from Python with Arrow
> --
>
> Key: SPARK-25351
> URL: https://issues.apache.org/jira/browse/SPARK-25351
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Bryan Cutler
>Priority: Major
>  Labels: bulk-closed
>
> There needs to be some handling of category types done when calling 
> {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}.  
> Without Arrow, Spark casts each element to the category. For example 
> {noformat}
> In [1]: import pandas as pd
> In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]})
> In [3]: pdf["B"] = pdf["A"].astype('category')
> In [4]: pdf
> Out[4]: 
>A  B
> 0  a  a
> 1  b  b
> 2  c  c
> 3  a  a
> In [5]: pdf.dtypes
> Out[5]: 
> A  object
> Bcategory
> dtype: object
> In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False)
> In [8]: df = spark.createDataFrame(pdf)
> In [9]: df.show()
> +---+---+
> |  A|  B|
> +---+---+
> |  a|  a|
> |  b|  b|
> |  c|  c|
> |  a|  a|
> +---+---+
> In [10]: df.printSchema()
> root
>  |-- A: string (nullable = true)
>  |-- B: string (nullable = true)
> In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True)
> In [19]: df = spark.createDataFrame(pdf)   
>1667 spark_type = ArrayType(from_arrow_type(at.value_type))
>1668 else:
> -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " 
> + str(at))
>1670 return spark_type
>1671 
> TypeError: Unsupported type in conversion from Arrow: 
> dictionary
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29493) Add MapType support for Arrow Java

2019-11-12 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972833#comment-16972833
 ] 

Bryan Cutler commented on SPARK-29493:
--

[~jalpan.randeri] this depends on SPARK-29376 for a newer version of Arrow. I 
was planning on doing this after because it might be a little tricky. If you 
want to take a look at SPARK-25351, I think that would be more straightforward 
to add.

> Add MapType support for Arrow Java
> --
>
> Key: SPARK-29493
> URL: https://issues.apache.org/jira/browse/SPARK-29493
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> This will add MapType support for Arrow in Spark ArrowConverters. This can 
> happen after the Arrow 0.15.0 upgrade, but MapType is not available in 
> pyarrow yet, so pyspark and pandas_udf will come later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23485) Kubernetes should support node blacklist

2019-11-12 Thread Imran Rashid (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-23485:
-
Affects Version/s: (was: 2.3.0)
   3.0.0
   2.4.0

> Kubernetes should support node blacklist
> 
>
> Key: SPARK-23485
> URL: https://issues.apache.org/jira/browse/SPARK-23485
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Scheduler
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: bulk-closed
>
> Spark's BlacklistTracker maintains a list of "bad nodes" which it will not 
> use for running tasks (eg., because of bad hardware).  When running in yarn, 
> this blacklist is used to avoid ever allocating resources on blacklisted 
> nodes: 
> https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128
> I'm just beginning to poke around the kubernetes code, so apologies if this 
> is incorrect -- but I didn't see any references to 
> {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it 
> seems this is missing.  Thought of this while looking at SPARK-19755, a 
> similar issue on mesos.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23485) Kubernetes should support node blacklist

2019-11-12 Thread Imran Rashid (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-23485:
-
Labels:   (was: bulk-closed)

> Kubernetes should support node blacklist
> 
>
> Key: SPARK-23485
> URL: https://issues.apache.org/jira/browse/SPARK-23485
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Scheduler
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> Spark's BlacklistTracker maintains a list of "bad nodes" which it will not 
> use for running tasks (eg., because of bad hardware).  When running in yarn, 
> this blacklist is used to avoid ever allocating resources on blacklisted 
> nodes: 
> https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128
> I'm just beginning to poke around the kubernetes code, so apologies if this 
> is incorrect -- but I didn't see any references to 
> {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it 
> seems this is missing.  Thought of this while looking at SPARK-19755, a 
> similar issue on mesos.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-23485) Kubernetes should support node blacklist

2019-11-12 Thread Imran Rashid (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reopened SPARK-23485:
--

I think this issue is still valid for current versions, it was just opened 
against 2.3.0, so reopening.

> Kubernetes should support node blacklist
> 
>
> Key: SPARK-23485
> URL: https://issues.apache.org/jira/browse/SPARK-23485
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Scheduler
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: bulk-closed
>
> Spark's BlacklistTracker maintains a list of "bad nodes" which it will not 
> use for running tasks (eg., because of bad hardware).  When running in yarn, 
> this blacklist is used to avoid ever allocating resources on blacklisted 
> nodes: 
> https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128
> I'm just beginning to poke around the kubernetes code, so apologies if this 
> is incorrect -- but I didn't see any references to 
> {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it 
> seems this is missing.  Thought of this while looking at SPARK-19755, a 
> similar issue on mesos.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29493) Add MapType support for Arrow Java

2019-11-12 Thread Jalpan Randeri (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972898#comment-16972898
 ] 

Jalpan Randeri commented on SPARK-29493:


Sure. i will include python changes also in my pr.

> Add MapType support for Arrow Java
> --
>
> Key: SPARK-29493
> URL: https://issues.apache.org/jira/browse/SPARK-29493
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> This will add MapType support for Arrow in Spark ArrowConverters. This can 
> happen after the Arrow 0.15.0 upgrade, but MapType is not available in 
> pyarrow yet, so pyspark and pandas_udf will come later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29844) Improper unpersist strategy in ml.recommendation.ASL.train

2019-11-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29844.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26469
[https://github.com/apache/spark/pull/26469]

> Improper unpersist strategy in ml.recommendation.ASL.train
> --
>
> Key: SPARK-29844
> URL: https://issues.apache.org/jira/browse/SPARK-29844
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Assignee: Dong Wang
>Priority: Minor
> Fix For: 3.0.0
>
>
> In ml.recommendation.ASL.train(), there are many intermediate RDDs. At the 
> end of the method, these RDDs invoke unpersist(), but the timings of 
> unpersist is not right, which will cause recomputation and memory waste.
> {code:scala}
> val userIdAndFactors = userInBlocks
>   .mapValues(_.srcIds)
>   .join(userFactors)
>   .mapPartitions({ items =>
> items.flatMap { case (_, (ids, factors)) =>
>   ids.view.zip(factors)
> }
>   // Preserve the partitioning because IDs are consistent with the 
> partitioners in userInBlocks
>   // and userFactors.
>   }, preservesPartitioning = true)
>   .setName("userFactors")
>   .persist(finalRDDStorageLevel) // Missing unpersist, but hard to fix
> val itemIdAndFactors = itemInBlocks
>   .mapValues(_.srcIds)
>   .join(itemFactors)
>   .mapPartitions({ items =>
> items.flatMap { case (_, (ids, factors)) =>
>   ids.view.zip(factors)
> }
>   }, preservesPartitioning = true)
>   .setName("itemFactors")
>   .persist(finalRDDStorageLevel) // Missing unpersist, but hard to fix
> if (finalRDDStorageLevel != StorageLevel.NONE) {
>   userIdAndFactors.count()
>   itemFactors.unpersist() // Premature unpersist
>   itemIdAndFactors.count()
>   userInBlocks.unpersist() // Lagging unpersist
>   userOutBlocks.unpersist() // Lagging unpersist
>   itemInBlocks.unpersist() 
>   itemOutBlocks.unpersist() // Lagging unpersist
>   blockRatings.unpersist() // Lagging unpersist
> }
> (userIdAndFactors, itemIdAndFactors)
>   }
> {code}
> 1. Unpersist itemFactors too early. itemIdAndFactors.count() will use 
> itemFactors. So itemFactors will be recomputed.
> 2. Unpersist userInBlocks, userOutBlocks, itemOutBlocks, and blockRatings too 
> late. The final action - itemIdAndFactors.count() will not use these RDDs, so 
> these RDDs can be unpersisted before it to save memory.
> By the way, itemIdAndFactors is persisted here but will never be unpersisted 
> util the application ends. It may hurts the performance, but I think it's 
> hard to fix.
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29844) Improper unpersist strategy in ml.recommendation.ASL.train

2019-11-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-29844:


Assignee: Dong Wang

> Improper unpersist strategy in ml.recommendation.ASL.train
> --
>
> Key: SPARK-29844
> URL: https://issues.apache.org/jira/browse/SPARK-29844
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Assignee: Dong Wang
>Priority: Minor
>
> In ml.recommendation.ASL.train(), there are many intermediate RDDs. At the 
> end of the method, these RDDs invoke unpersist(), but the timings of 
> unpersist is not right, which will cause recomputation and memory waste.
> {code:scala}
> val userIdAndFactors = userInBlocks
>   .mapValues(_.srcIds)
>   .join(userFactors)
>   .mapPartitions({ items =>
> items.flatMap { case (_, (ids, factors)) =>
>   ids.view.zip(factors)
> }
>   // Preserve the partitioning because IDs are consistent with the 
> partitioners in userInBlocks
>   // and userFactors.
>   }, preservesPartitioning = true)
>   .setName("userFactors")
>   .persist(finalRDDStorageLevel) // Missing unpersist, but hard to fix
> val itemIdAndFactors = itemInBlocks
>   .mapValues(_.srcIds)
>   .join(itemFactors)
>   .mapPartitions({ items =>
> items.flatMap { case (_, (ids, factors)) =>
>   ids.view.zip(factors)
> }
>   }, preservesPartitioning = true)
>   .setName("itemFactors")
>   .persist(finalRDDStorageLevel) // Missing unpersist, but hard to fix
> if (finalRDDStorageLevel != StorageLevel.NONE) {
>   userIdAndFactors.count()
>   itemFactors.unpersist() // Premature unpersist
>   itemIdAndFactors.count()
>   userInBlocks.unpersist() // Lagging unpersist
>   userOutBlocks.unpersist() // Lagging unpersist
>   itemInBlocks.unpersist() 
>   itemOutBlocks.unpersist() // Lagging unpersist
>   blockRatings.unpersist() // Lagging unpersist
> }
> (userIdAndFactors, itemIdAndFactors)
>   }
> {code}
> 1. Unpersist itemFactors too early. itemIdAndFactors.count() will use 
> itemFactors. So itemFactors will be recomputed.
> 2. Unpersist userInBlocks, userOutBlocks, itemOutBlocks, and blockRatings too 
> late. The final action - itemIdAndFactors.count() will not use these RDDs, so 
> these RDDs can be unpersisted before it to save memory.
> By the way, itemIdAndFactors is persisted here but will never be unpersisted 
> util the application ends. It may hurts the performance, but I think it's 
> hard to fix.
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29570) Improve tooltip for Executor Tab for Shuffle Write,Blacklisted,Logs,Threaddump columns

2019-11-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-29570:


Assignee: Ankit Raj Boudh

> Improve tooltip for Executor Tab for Shuffle 
> Write,Blacklisted,Logs,Threaddump columns
> --
>
> Key: SPARK-29570
> URL: https://issues.apache.org/jira/browse/SPARK-29570
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: Ankit Raj Boudh
>Priority: Minor
>
> When User move mouse over the columns under Executors Shuffle 
> Write,Blacklisted,Logs,Threaddump columns, tooltip not display at center. 
> Check the other columns it display at center.
> Please fix this issue in all Spark WEB UI page and History UI Page.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29570) Improve tooltip for Executor Tab for Shuffle Write,Blacklisted,Logs,Threaddump columns

2019-11-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29570.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26263
[https://github.com/apache/spark/pull/26263]

> Improve tooltip for Executor Tab for Shuffle 
> Write,Blacklisted,Logs,Threaddump columns
> --
>
> Key: SPARK-29570
> URL: https://issues.apache.org/jira/browse/SPARK-29570
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: Ankit Raj Boudh
>Priority: Minor
> Fix For: 3.0.0
>
>
> When User move mouse over the columns under Executors Shuffle 
> Write,Blacklisted,Logs,Threaddump columns, tooltip not display at center. 
> Check the other columns it display at center.
> Please fix this issue in all Spark WEB UI page and History UI Page.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29399) Remove old ExecutorPlugin API (or wrap it using new API)

2019-11-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-29399:


Assignee: Marcelo Masiero Vanzin

> Remove old ExecutorPlugin API (or wrap it using new API)
> 
>
> Key: SPARK-29399
> URL: https://issues.apache.org/jira/browse/SPARK-29399
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Masiero Vanzin
>Assignee: Marcelo Masiero Vanzin
>Priority: Major
>
> The parent bug is proposing a new plugin API for Spark, so we could remove 
> the old one since it's a developer API.
> If we can get the new API into Spark 3.0, then removal might be a better idea 
> than deprecation. That would be my preference since then we can remove the 
> new elements added as part of SPARK-28091 without having to deprecate them 
> first.
> If it doesn't make it into 3.0, then we should deprecate the APIs from 3.0 
> and manage old plugins through the new code being added.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29396) Extend Spark plugin interface to driver

2019-11-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29396.
--
Fix Version/s: 3.0.0
 Assignee: Marcelo Masiero Vanzin
   Resolution: Done

> Extend Spark plugin interface to driver
> ---
>
> Key: SPARK-29396
> URL: https://issues.apache.org/jira/browse/SPARK-29396
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Masiero Vanzin
>Assignee: Marcelo Masiero Vanzin
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Spark provides an extension API for people to implement executor plugins, 
> added in SPARK-24918 and later extended in SPARK-28091.
> That API does not offer any functionality for doing similar things on the 
> driver side, though. As a consequence of that, there is not a good way for 
> the executor plugins to get information or communicate in any way with the 
> Spark driver.
> I've been playing with such an improved API for developing some new 
> functionality. I'll file a few child bugs for the work to get the changes in.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29399) Remove old ExecutorPlugin API (or wrap it using new API)

2019-11-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29399.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26390
[https://github.com/apache/spark/pull/26390]

> Remove old ExecutorPlugin API (or wrap it using new API)
> 
>
> Key: SPARK-29399
> URL: https://issues.apache.org/jira/browse/SPARK-29399
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Masiero Vanzin
>Assignee: Marcelo Masiero Vanzin
>Priority: Major
> Fix For: 3.0.0
>
>
> The parent bug is proposing a new plugin API for Spark, so we could remove 
> the old one since it's a developer API.
> If we can get the new API into Spark 3.0, then removal might be a better idea 
> than deprecation. That would be my preference since then we can remove the 
> new elements added as part of SPARK-28091 without having to deprecate them 
> first.
> If it doesn't make it into 3.0, then we should deprecate the APIs from 3.0 
> and manage old plugins through the new code being added.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29868) Add a benchmark for Adaptive Execution

2019-11-12 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-29868:
---

 Summary: Add a benchmark for Adaptive Execution
 Key: SPARK-29868
 URL: https://issues.apache.org/jira/browse/SPARK-29868
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 3.0.0
Reporter: Yuming Wang


Add a benchmark for Adaptive Execution to evaluate SortMergeJoin to 

BroadcastJoin performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29868) Add a benchmark for Adaptive Execution

2019-11-12 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29868:

Attachment: BroadcastJoin.jpeg
SortMergeJoin.jpeg

> Add a benchmark for Adaptive Execution
> --
>
> Key: SPARK-29868
> URL: https://issues.apache.org/jira/browse/SPARK-29868
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: BroadcastJoin.jpeg, SortMergeJoin.jpeg
>
>
> Add a benchmark for Adaptive Execution to evaluate SortMergeJoin to 
> BroadcastJoin performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29868) Add a benchmark for Adaptive Execution

2019-11-12 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29868:

Description: 
Add a benchmark for Adaptive Execution to evaluate SortMergeJoin to 
BroadcastJoin performance.

It seem SortMergeJoin faster than BroadcastJoin if one side is a bucketed table 
and can convert to BroadcastJoin.

  was:
Add a benchmark for Adaptive Execution to evaluate SortMergeJoin to 

BroadcastJoin performance.


> Add a benchmark for Adaptive Execution
> --
>
> Key: SPARK-29868
> URL: https://issues.apache.org/jira/browse/SPARK-29868
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: BroadcastJoin.jpeg, SortMergeJoin.jpeg
>
>
> Add a benchmark for Adaptive Execution to evaluate SortMergeJoin to 
> BroadcastJoin performance.
> It seem SortMergeJoin faster than BroadcastJoin if one side is a bucketed 
> table and can convert to BroadcastJoin.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29868) Add a benchmark for Adaptive Execution

2019-11-12 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972967#comment-16972967
 ] 

Yuming Wang commented on SPARK-29868:
-

Examples: 
[https://github.com/apache/spark/tree/cdea520ff8954cf415fd98d034d9b674d6ca4f67/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark]

> Add a benchmark for Adaptive Execution
> --
>
> Key: SPARK-29868
> URL: https://issues.apache.org/jira/browse/SPARK-29868
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: BroadcastJoin.jpeg, SortMergeJoin.jpeg
>
>
> Add a benchmark for Adaptive Execution to evaluate SortMergeJoin to 
> BroadcastJoin performance.
> It seem SortMergeJoin faster than BroadcastJoin if one side is a bucketed 
> table and can convert to BroadcastJoin.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29869) HiveMetastoreCatalog#convertToLogicalRelation throws AssertionError

2019-11-12 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-29869:
---

 Summary: HiveMetastoreCatalog#convertToLogicalRelation throws 
AssertionError
 Key: SPARK-29869
 URL: https://issues.apache.org/jira/browse/SPARK-29869
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


{noformat}
scala> spark.table("table").show
java.lang.AssertionError: assertion failed
  at scala.Predef$.assert(Predef.scala:208)
  at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.convertToLogicalRelation(HiveMetastoreCatalog.scala:261)
  at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.convert(HiveMetastoreCatalog.scala:137)
  at 
org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:220)
  at 
org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:207)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$4(AnalysisHelper.scala:113)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:113)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators(AnalysisHelper.scala:73)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators$(AnalysisHelper.scala:72)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29)
  at 
org.apache.spark.sql.hive.RelationConversions.apply(HiveStrategies.scala:207)
  at 
org.apache.spark.sql.hive.RelationConversions.apply(HiveStrategies.scala:191)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:130)
  at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
  at 
scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
  at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:49)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:127)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:119)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:119)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:168)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:162)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:122)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:98)
  at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:98)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:146)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:145)
  at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:66)
  at 
org.apache.spark.sql.ca

[jira] [Updated] (SPARK-29869) HiveMetastoreCatalog#convertToLogicalRelation throws AssertionError

2019-11-12 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29869:

Attachment: SPARK-29869.jpg

> HiveMetastoreCatalog#convertToLogicalRelation throws AssertionError
> ---
>
> Key: SPARK-29869
> URL: https://issues.apache.org/jira/browse/SPARK-29869
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: SPARK-29869.jpg
>
>
> {noformat}
> scala> spark.table("table").show
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:208)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.convertToLogicalRelation(HiveMetastoreCatalog.scala:261)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.convert(HiveMetastoreCatalog.scala:137)
>   at 
> org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:220)
>   at 
> org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:207)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$4(AnalysisHelper.scala:113)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:113)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators(AnalysisHelper.scala:73)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators$(AnalysisHelper.scala:72)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.hive.RelationConversions.apply(HiveStrategies.scala:207)
>   at 
> org.apache.spark.sql.hive.RelationConversions.apply(HiveStrategies.scala:191)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:130)
>   at 
> scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
>   at 
> scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
>   at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:49)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:127)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:119)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:119)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:168)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:162)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:122)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:98)
>   at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAn

[jira] [Commented] (SPARK-29868) Add a benchmark for Adaptive Execution

2019-11-12 Thread leonard ding (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972974#comment-16972974
 ] 

leonard ding commented on SPARK-29868:
--

I will pick it

> Add a benchmark for Adaptive Execution
> --
>
> Key: SPARK-29868
> URL: https://issues.apache.org/jira/browse/SPARK-29868
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: BroadcastJoin.jpeg, SortMergeJoin.jpeg
>
>
> Add a benchmark for Adaptive Execution to evaluate SortMergeJoin to 
> BroadcastJoin performance.
> It seem SortMergeJoin faster than BroadcastJoin if one side is a bucketed 
> table and can convert to BroadcastJoin.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29869) HiveMetastoreCatalog#convertToLogicalRelation throws AssertionError

2019-11-12 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29869:

Description: 
{noformat}
scala> spark.table("hive_table").show
java.lang.AssertionError: assertion failed
  at scala.Predef$.assert(Predef.scala:208)
  at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.convertToLogicalRelation(HiveMetastoreCatalog.scala:261)
  at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.convert(HiveMetastoreCatalog.scala:137)
  at 
org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:220)
  at 
org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:207)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$4(AnalysisHelper.scala:113)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:113)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators(AnalysisHelper.scala:73)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators$(AnalysisHelper.scala:72)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29)
  at 
org.apache.spark.sql.hive.RelationConversions.apply(HiveStrategies.scala:207)
  at 
org.apache.spark.sql.hive.RelationConversions.apply(HiveStrategies.scala:191)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:130)
  at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
  at 
scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
  at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:49)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:127)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:119)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:119)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:168)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:162)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:122)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:98)
  at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:98)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:146)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:145)
  at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:66)
  at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
  at 
org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:63)
  at 
org.apache.spark

[jira] [Commented] (SPARK-29855) typed literals with negative sign with proper result or exception

2019-11-12 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972983#comment-16972983
 ] 

Kent Yao commented on SPARK-29855:
--

This is the related ticket that involves the negative sign before typed literals

> typed literals with negative sign with proper result or exception
> -
>
> Key: SPARK-29855
> URL: https://issues.apache.org/jira/browse/SPARK-29855
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> {code:java}
> -- !query 83
> select -integer '7'
> -- !query 83 schema
> struct<7:int>
> -- !query 83 output
> 7
> -- !query 86
> select -date '1999-01-01'
> -- !query 86 schema
> struct
> -- !query 86 output
> 1999-01-01
> -- !query 87
> select -timestamp '1999-01-01'
> -- !query 87 schema
> struct
> -- !query 87 output
> 1999-01-01 00:00:00
> {code}
> the integer should be -7 and the date and timestamp results are confusing 
> which should throw exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29870) Unify the logic of multi-units interval string to CalendarInterval

2019-11-12 Thread Kent Yao (Jira)
Kent Yao created SPARK-29870:


 Summary: Unify the logic of multi-units interval string to 
CalendarInterval
 Key: SPARK-29870
 URL: https://issues.apache.org/jira/browse/SPARK-29870
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kent Yao


We now have two different implementation for multi-units interval strings to 
CalendarInterval type values.

One is used to covert interval string literals to CalendarInterval. This 
approach will re-delegate the interval string to spark parser which handles the 
string as a `singleInterval` -> `multiUnitsInterval`  -> eventually call 
`IntervalUtils.fromUnitStrings`

The other is used in `Cast`,  which eventually calls 
`IntervalUtils.stringToInterval`. This approach is ~10 times faster than the 
other.

We should unify these two for better performance and simple logic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29870) Unify the logic of multi-units interval string to CalendarInterval

2019-11-12 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972992#comment-16972992
 ] 

Kent Yao commented on SPARK-29870:
--

working on this

> Unify the logic of multi-units interval string to CalendarInterval
> --
>
> Key: SPARK-29870
> URL: https://issues.apache.org/jira/browse/SPARK-29870
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Priority: Major
>
> We now have two different implementation for multi-units interval strings to 
> CalendarInterval type values.
> One is used to covert interval string literals to CalendarInterval. This 
> approach will re-delegate the interval string to spark parser which handles 
> the string as a `singleInterval` -> `multiUnitsInterval`  -> eventually call 
> `IntervalUtils.fromUnitStrings`
> The other is used in `Cast`,  which eventually calls 
> `IntervalUtils.stringToInterval`. This approach is ~10 times faster than the 
> other.
> We should unify these two for better performance and simple logic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29462) The data type of "array()" should be array

2019-11-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29462:
-
Fix Version/s: (was: 3.0.0)

> The data type of "array()" should be array
> 
>
> Key: SPARK-29462
> URL: https://issues.apache.org/jira/browse/SPARK-29462
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> In the current implmentation:
> > spark.sql("select array()")
> res0: org.apache.spark.sql.DataFrame = [array(): array]
> The output type should be array



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29462) The data type of "array()" should be array

2019-11-12 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973000#comment-16973000
 ] 

Hyukjin Kwon commented on SPARK-29462:
--

Reverted due to the reasons 
https://github.com/apache/spark/pull/26324#issuecomment-553230428

> The data type of "array()" should be array
> 
>
> Key: SPARK-29462
> URL: https://issues.apache.org/jira/browse/SPARK-29462
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> In the current implmentation:
> > spark.sql("select array()")
> res0: org.apache.spark.sql.DataFrame = [array(): array]
> The output type should be array



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-29462) The data type of "array()" should be array

2019-11-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-29462:
--
  Assignee: (was: Aman Omer)

> The data type of "array()" should be array
> 
>
> Key: SPARK-29462
> URL: https://issues.apache.org/jira/browse/SPARK-29462
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Minor
> Fix For: 3.0.0
>
>
> In the current implmentation:
> > spark.sql("select array()")
> res0: org.apache.spark.sql.DataFrame = [array(): array]
> The output type should be array



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29854) lpad and rpad built in function not throw Exception for invalid len value

2019-11-12 Thread ABHISHEK KUMAR GUPTA (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973007#comment-16973007
 ] 

ABHISHEK KUMAR GUPTA commented on SPARK-29854:
--

[~hyukjin.kwon]: Hi I checked in Hive PostgreSQL : It gives Excpetion
In my SQL It gives NULL
mysql> SELECT lpad('hihhh', 5000, 
'');
+-+
| lpad('hihhh', 5000, '') |
+-+
| NULL|
+-+
1 row in set, 3 warnings (0.00 sec)

mysql> SELECT rpad('hihhh', 5000, 
'');
+-+
| rpad('hihhh', 5000, '') |
+-+
| NULL|
+-+
1 row in set, 3 warnings (0.00 sec)

> lpad and rpad built in function not throw Exception for invalid len value
> -
>
> Key: SPARK-29854
> URL: https://issues.apache.org/jira/browse/SPARK-29854
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> Spark Returns Empty String)
> {code}
> 0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
> lpad('hihhh', 5000, '');
>  ++
> |lpad(hihhh, CAST(5000 AS INT), 
> )|
> ++
> ++
> Hive:
> SELECT lpad('hihhh', 5000, 
> '');
>  Error: Error while compiling statement: FAILED: SemanticException [Error 
> 10016]: Line 1:67 Argument type mismatch '''': lpad only takes 
> INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016)
> PostgreSQL
> function lpad(unknown, numeric, unknown) does not exist
>  
> Expected output:
> In Spark also it should throw Exception like Hive
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow

2019-11-12 Thread Jalpan Randeri (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973023#comment-16973023
 ] 

Jalpan Randeri commented on SPARK-25351:


Yes, I will take up this changes as part of my PR

> Handle Pandas category type when converting from Python with Arrow
> --
>
> Key: SPARK-25351
> URL: https://issues.apache.org/jira/browse/SPARK-25351
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Bryan Cutler
>Priority: Major
>  Labels: bulk-closed
>
> There needs to be some handling of category types done when calling 
> {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}.  
> Without Arrow, Spark casts each element to the category. For example 
> {noformat}
> In [1]: import pandas as pd
> In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]})
> In [3]: pdf["B"] = pdf["A"].astype('category')
> In [4]: pdf
> Out[4]: 
>A  B
> 0  a  a
> 1  b  b
> 2  c  c
> 3  a  a
> In [5]: pdf.dtypes
> Out[5]: 
> A  object
> Bcategory
> dtype: object
> In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False)
> In [8]: df = spark.createDataFrame(pdf)
> In [9]: df.show()
> +---+---+
> |  A|  B|
> +---+---+
> |  a|  a|
> |  b|  b|
> |  c|  c|
> |  a|  a|
> +---+---+
> In [10]: df.printSchema()
> root
>  |-- A: string (nullable = true)
>  |-- B: string (nullable = true)
> In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True)
> In [19]: df = spark.createDataFrame(pdf)   
>1667 spark_type = ArrayType(from_arrow_type(at.value_type))
>1668 else:
> -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " 
> + str(at))
>1670 return spark_type
>1671 
> TypeError: Unsupported type in conversion from Arrow: 
> dictionary
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29390) Add the justify_days(), justify_hours() and justify_interval() functions

2019-11-12 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-29390.
--
Fix Version/s: 3.0.0
 Assignee: Kent Yao
   Resolution: Fixed

Resolved by 
[https://github.com/apache/spark/pull/26465|https://github.com/apache/spark/pull/26465#]

> Add  the justify_days(),  justify_hours() and  justify_interval() functions
> ---
>
> Key: SPARK-29390
> URL: https://issues.apache.org/jira/browse/SPARK-29390
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> See *Table 9.31. Date/Time Functions* 
> ([https://www.postgresql.org/docs/12/functions-datetime.html)]
> |{{justify_days(}}{{interval}}{{)}}|{{interval}}|Adjust interval so 30-day 
> time periods are represented as months|{{justify_days(interval '35 
> days')}}|{{1 mon 5 days}}|
> | {{justify_hours(}}{{interval}}{{)}}|{{interval}}|Adjust interval so 24-hour 
> time periods are represented as days|{{justify_hours(interval '27 
> hours')}}|{{1 day 03:00:00}}|
> | {{justify_interval(}}{{interval}}{{)}}|{{interval}}|Adjust interval using 
> {{justify_days}} and {{justify_hours}}, with additional sign 
> adjustments|{{justify_interval(interval '1 mon -1 hour')}}|{{29 days 
> 23:00:00}}|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29871) Flaky test: ImageFileFormatTest.test_read_images

2019-11-12 Thread wuyi (Jira)
wuyi created SPARK-29871:


 Summary: Flaky test: ImageFileFormatTest.test_read_images
 Key: SPARK-29871
 URL: https://issues.apache.org/jira/browse/SPARK-29871
 Project: Spark
  Issue Type: Test
  Components: ML
Affects Versions: 3.0.0
Reporter: wuyi


Running tests...
--
 test_read_images (pyspark.ml.tests.test_image.ImageFileFormatTest) ... ERROR 
(12.050s)

==
ERROR [12.050s]: test_read_images 
(pyspark.ml.tests.test_image.ImageFileFormatTest)
--
Traceback (most recent call last):
 File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/tests/test_image.py",
 line 35, in test_read_images
 self.assertEqual(df.count(), 4)
 File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/dataframe.py",
 line 507, in count
 return int(self._jdf.count())
 File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py",
 line 1286, in __call__
 answer, self.gateway_client, self.target_id, self.name)
 File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/utils.py", 
line 98, in deco
 return f(*a, **kw)
 File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py",
 line 328, in get_return_value
 format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o32.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 
1, amp-jenkins-worker-05.amp, executor driver): javax.imageio.IIOException: 
Unsupported Image Type
 at 
com.sun.imageio.plugins.jpeg.JPEGImageReader.readInternal(JPEGImageReader.java:1079)
 at com.sun.imageio.plugins.jpeg.JPEGImageReader.read(JPEGImageReader.java:1050)
 at javax.imageio.ImageIO.read(ImageIO.java:1448)
 at javax.imageio.ImageIO.read(ImageIO.java:1352)
 at org.apache.spark.ml.image.ImageSchema$.decode(ImageSchema.scala:134)
 at 
org.apache.spark.ml.source.image.ImageFileFormat.$anonfun$buildReader$2(ImageFileFormat.scala:84)
 at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:147)
 at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:132)
 at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116)
 at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
 at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(generated.java:33)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:63)
 at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:726)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
 at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
 at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
 at org.apache.spark.scheduler.Task.run(Task.scala:127)
 at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
 at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:1979)
 at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1967)
 at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1966)
 at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
 at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
 at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGS

[jira] [Updated] (SPARK-29869) HiveMetastoreCatalog#convertToLogicalRelation throws AssertionError

2019-11-12 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29869:

Description: 
{noformat}
scala> spark.table("hive_table").show
java.lang.AssertionError: assertion failed
  at scala.Predef$.assert(Predef.scala:208)
  at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.convertToLogicalRelation(HiveMetastoreCatalog.scala:261)
  at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.convert(HiveMetastoreCatalog.scala:137)
  at 
org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:220)
  at 
org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:207)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$4(AnalysisHelper.scala:113)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:113)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators(AnalysisHelper.scala:73)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators$(AnalysisHelper.scala:72)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29)
  at 
org.apache.spark.sql.hive.RelationConversions.apply(HiveStrategies.scala:207)
  at 
org.apache.spark.sql.hive.RelationConversions.apply(HiveStrategies.scala:191)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:130)
  at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
  at 
scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
  at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:49)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:127)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:119)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:119)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:168)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:162)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:122)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:98)
  at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:98)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:146)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:145)
  at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:66)
  at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
  at 
org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:63)
  at 
org.apache.spark

[jira] [Created] (SPARK-29872) Improper cache strategy in examples

2019-11-12 Thread Dong Wang (Jira)
Dong Wang created SPARK-29872:
-

 Summary: Improper cache strategy in examples
 Key: SPARK-29872
 URL: https://issues.apache.org/jira/browse/SPARK-29872
 Project: Spark
  Issue Type: Improvement
  Components: Examples
Affects Versions: 3.0.0
Reporter: Dong Wang


1. Improper cache in examples.SparkTC
The RDD edges should be cached because it is used multiple times in while loop. 
And it should be unpersisted before the last action tc.count(), because tc has 
been persisted.
On the other hand, many tc objects is cached in while loop but never uncached, 
which will waste memory.
{code:scala}
val edges = tc.map(x => (x._2, x._1)) // Edges should be cached
// This join is iterated until a fixed point is reached.
var oldCount = 0L
var nextCount = tc.count()
do { 
  oldCount = nextCount
  // Perform the join, obtaining an RDD of (y, (z, x)) pairs,
  // then project the result to obtain the new (x, z) paths.
  tc = tc.union(tc.join(edges).map(x => (x._2._2, 
x._2._1))).distinct().cache()
  nextCount = tc.count()
} while (nextCount != oldCount)
println(s"TC has ${tc.count()} edges.")
{code}

2. Cache needed in examples.ml.LogisticRegressionSummary
The DataFrame fMeasure should be cached.
{code:scala}
// Set the model threshold to maximize F-Measure
val fMeasure = trainingSummary.fMeasureByThreshold // fMeasures should be 
cached
val maxFMeasure = fMeasure.select(max("F-Measure")).head().getDouble(0)
val bestThreshold = fMeasure.where($"F-Measure" === maxFMeasure)
  .select("threshold").head().getDouble(0)
lrModel.setThreshold(bestThreshold)
{code}

3. Cache needed in examples.sql.SparkSQLExample

{code:scala}
val peopleDF = spark.sparkContext
  .textFile("examples/src/main/resources/people.txt")
  .map(_.split(","))
  .map(attributes => Person(attributes(0), attributes(1).trim.toInt)) // 
This RDD should be cahced
  .toDF()
// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView("people")
val teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 
13 AND 19")
teenagersDF.map(teenager => "Name: " + teenager(0)).show()
teenagersDF.map(teenager => "Name: " + 
teenager.getAs[String]("name")).show()
implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, 
Any]]
teenagersDF.map(teenager => teenager.getValuesMap[Any](List("name", 
"age"))).collect()
{code}

This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29861) Reduce downtime in Spark standalone HA master switch

2019-11-12 Thread Robin Wolters (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Wolters updated SPARK-29861:
--
Summary: Reduce downtime in Spark standalone HA master switch  (was: Reduce 
leader election downtime in Spark standalone HA)

> Reduce downtime in Spark standalone HA master switch
> 
>
> Key: SPARK-29861
> URL: https://issues.apache.org/jira/browse/SPARK-29861
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Robin Wolters
>Priority: Minor
>
> As officially stated in the spark [HA 
> documention|https://spark.apache.org/docs/latest/spark-standalone.html#standby-masters-with-zookeeper],
>  the recovery process of Spark (standalone) master in HA with zookeeper takes 
> about 1-2 minutes. During this time no spark master is active, which makes 
> interaction with spark essentially impossible. 
> After looking for a way to reduce this downtime, it seems that this is mainly 
> caused by the leader election, which waits for open zookeeper connections to 
> be closed. This seems like an unnecessary downtime for example in case of a 
> planned VM update.
> I have fixed this in my setup by:
>  # Closing open zookeeper connections during spark shutdown
>  # Bumping the curator version and implementing a custom error policy that is 
> tolerant to a zookeeper connection suspension.
> I am preparing a pull request for review / further discussion on this issue.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org