[GitHub] spark issue #23262: [SPARK-26312][SQL]Replace RDDConversions.rowToRowRdd wit...

2018-12-10 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/23262
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23262: [SPARK-26312][SQL]Replace RDDConversions.rowToRowRdd wit...

2018-12-10 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/23262
  
@cloud-fan Updated, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23262: [SPARK-26312][SQL]Replace RDDConversions.rowToRow...

2018-12-10 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/23262#discussion_r240188043
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 ---
@@ -416,7 +416,12 @@ case class DataSourceStrategy(conf: SQLConf) extends 
Strategy with Logging with
   output: Seq[Attribute],
   rdd: RDD[Row]): RDD[InternalRow] = {
 if (relation.relation.needConversion) {
-  execution.RDDConversions.rowToRowRdd(rdd, output.map(_.dataType))
+  val converters = RowEncoder(StructType.fromAttributes(output))
+  rdd.mapPartitions { iterator =>
+iterator.map { r =>
--- End diff --

Modified, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23262: [SPARK-26312][SQL]Converting converters in RDDConversion...

2018-12-10 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/23262
  
@HyukjinKwon Ok, removed it, thanks for review.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23262: [SPARK-26312][SQL]Converting converters in RDDConversion...

2018-12-10 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/23262
  
@HyukjinKwon @mgaido91 Thanks for review. @cloud-fan @kiszk  Would you like 
to give some suggestions: remove the object `RDDConversions` , or leave it 
there?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23262: [SPARK-26312][SQL]Converting converters in RDDCon...

2018-12-10 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/23262#discussion_r240114106
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala ---
@@ -53,7 +53,7 @@ object RDDConversions {
 data.mapPartitions { iterator =>
   val numColumns = outputTypes.length
   val mutableRow = new GenericInternalRow(numColumns)
-  val converters = 
outputTypes.map(CatalystTypeConverters.createToCatalystConverter)
+  val converters = 
outputTypes.map(CatalystTypeConverters.createToCatalystConverter).toArray
--- End diff --

It has been modified, and the performance is the same as converting to 
arrays.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23262: [SPARK-26312][SQL]Converting converters in RDDCon...

2018-12-10 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/23262#discussion_r240113822
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala ---
@@ -33,7 +33,7 @@ object RDDConversions {
 data.mapPartitions { iterator =>
   val numColumns = outputTypes.length
   val mutableRow = new GenericInternalRow(numColumns)
-  val converters = 
outputTypes.map(CatalystTypeConverters.createToCatalystConverter)
+  val converters = 
outputTypes.map(CatalystTypeConverters.createToCatalystConverter).toArray
--- End diff --

It is a good suggestion, and has been modified, would you like to review it 
again, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23262: [SPARK-26312][SQL]Converting converters in RDDConversion...

2018-12-08 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/23262
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23262: [SPARK-26312][SQL]Converting converters in RDDCon...

2018-12-07 Thread eatoncys
GitHub user eatoncys opened a pull request:

https://github.com/apache/spark/pull/23262

[SPARK-26312][SQL]Converting converters in RDDConversions into arrays to 
improve their access performance


## What changes were proposed in this pull request?

`RDDConversions` would get disproportionately slower as the number of 
columns in the query increased.
This PR converts the `converters` in `RDDConversions` into arrays to 
improve their access performance, the type of `converters` before is 
`scala.collection.immutable.::` which is a subtype of list.

The test of `PrunedScanSuite` for 2000 columns and 20k rows takes 409 
seconds before this PR, and 361 seconds after.

## How was this patch tested?

Test case of `PrunedScanSuite` 

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/eatoncys/spark toarray

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23262.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23262


commit ddb252892a439281b16bc14fdfdb7faf756f1067
Author: 10129659 
Date:   2018-12-08T07:15:10Z

Converting converters in RDDConversions into arrays to improve their access 
performance




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23010: [SPARK-26012][SQL]Null and '' values should not cause dy...

2018-12-03 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/23010
  
But we may forget to filter null values when we write sql. The following 
function protects this situation and writes the value of null partitions as 
__HIVE_DEFAULT_PARTITION__

def getPartitionPathString(col: String, value: String): String = {
val partitionString = if (value == null || value.isEmpty) {
  DEFAULT_PARTITION_NAME
} else {
  escapePathName(value)
}
escapePathName(col) + "=" + partitionString
  }


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23010: [SPARK-26012][SQL]Null and '' values should not cause dy...

2018-12-03 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/23010
  
@cloud-fan, Thanks for review, Do you mean we should filter out invalid 
partitions in sql before write? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22561: [SPARK-25548][SQL]In the PruneFileSourcePartition...

2018-12-02 Thread eatoncys
Github user eatoncys closed the pull request at:

https://github.com/apache/spark/pull/22561


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23010: [SPARK-26012][SQL]Null and '' values should not cause dy...

2018-12-02 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/23010
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23010: [SPARK-26012][SQL]Null and '' values should not c...

2018-11-11 Thread eatoncys
GitHub user eatoncys opened a pull request:

https://github.com/apache/spark/pull/23010

[SPARK-26012][SQL]Null and '' values should not cause dynamic partition 
failure of string types

## What changes were proposed in this pull request?

Dynamic partition will fail when both '' and null values are taken as 
dynamic partition values simultaneously.
For example, the test bellow will fail before this PR:

  test("Null and '' values should not cause dynamic partition failure of 
string types") {
withTable("t1", "t2") {
  spark.range(3).write.saveAsTable("t1")
  spark.sql("select id, cast(case when id = 1 then '' else null end as 
string) as p" +
" from t1").write.partitionBy("p").saveAsTable("t2")
  checkAnswer(spark.table("t2").sort("id"), Seq(Row(0, null), Row(1, 
null), Row(2, null)))
}
  }

The error is: 'org.apache.hadoop.fs.FileAlreadyExistsException: File 
already exists'.
This PR adds exception protection to file conflicts, renaming the file when 
files conflict.


(Please fill in changes proposed in this fix)

## How was this patch tested?
New added test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/eatoncys/spark dynamicPartition

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23010.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23010


commit 1f18e2786a26eb64c52925d8ecff2d6a2295ca16
Author: 10129659 
Date:   2018-11-12T04:41:53Z

Null and '' values should not cause dynamic partition failure of string 
types




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22561: [SPARK-25548][SQL]In the PruneFileSourcePartitions optim...

2018-10-15 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/22561
  
@cloud-fan Yes, it has problems for Not expression, we need find some good 
ways. Thanks for review.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22561: [SPARK-25548][SQL]In the PruneFileSourcePartitions optim...

2018-10-15 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/22561
  
@cloud-fan which should be proved is that: the partitions returned of p' 
shound contain the partitions returned by p.  Here, let p' = p && x, if x is 
true then p' == p; else if x is false, the paritions returned we needed is 
none, so p' contains the partitions we needed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22561: [SPARK-25548][SQL]In the PruneFileSourcePartition...

2018-10-15 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/22561#discussion_r225054113
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala
 ---
@@ -39,21 +40,31 @@ private[sql] object PruneFileSourcePartitions extends 
Rule[LogicalPlan] {
 _,
 _))
 if filters.nonEmpty && fsRelation.partitionSchemaOption.isDefined 
=>
+
+  val sparkSession = fsRelation.sparkSession
+  val partitionColumns =
+logicalRelation.resolve(
+  partitionSchema, sparkSession.sessionState.analyzer.resolver)
+  val partitionSet = AttributeSet(partitionColumns)
   // The attribute name of predicate could be different than the one 
in schema in case of
   // case insensitive, we should change them to match the one in 
schema, so we donot need to
   // worry about case sensitivity anymore.
   val normalizedFilters = filters.map { e =>
-e transform {
+e transformUp {
   case a: AttributeReference =>
 
a.withName(logicalRelation.output.find(_.semanticEquals(a)).get.name)
+  // Replace the nonPartitionOps field with true in the 
And(partitionOps, nonPartitionOps)
+  // to make the partition can be pruned
+  case and @And(left, right) =>
+val leftPartition = 
left.references.filter(partitionSet.contains(_))
+val rightPartition = 
right.references.filter(partitionSet.contains(_))
+if (leftPartition.size == left.references.size && 
rightPartition.size == 0) {
--- End diff --

Ok, I will add a UT latter, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22561: [SPARK-25548][SQL]In the PruneFileSourcePartition...

2018-10-15 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/22561#discussion_r225053437
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala
 ---
@@ -39,21 +40,31 @@ private[sql] object PruneFileSourcePartitions extends 
Rule[LogicalPlan] {
 _,
 _))
 if filters.nonEmpty && fsRelation.partitionSchemaOption.isDefined 
=>
+
+  val sparkSession = fsRelation.sparkSession
+  val partitionColumns =
+logicalRelation.resolve(
+  partitionSchema, sparkSession.sessionState.analyzer.resolver)
+  val partitionSet = AttributeSet(partitionColumns)
   // The attribute name of predicate could be different than the one 
in schema in case of
   // case insensitive, we should change them to match the one in 
schema, so we donot need to
   // worry about case sensitivity anymore.
   val normalizedFilters = filters.map { e =>
-e transform {
+e transformUp {
   case a: AttributeReference =>
 
a.withName(logicalRelation.output.find(_.semanticEquals(a)).get.name)
+  // Replace the nonPartitionOps field with true in the 
And(partitionOps, nonPartitionOps)
+  // to make the partition can be pruned
+  case and @And(left, right) =>
+val leftPartition = 
left.references.filter(partitionSet.contains(_))
+val rightPartition = 
right.references.filter(partitionSet.contains(_))
+if (leftPartition.size == left.references.size && 
rightPartition.size == 0) {
+  and.withNewChildren(Seq(left, Literal(true, BooleanType)))
+} else if (leftPartition.size == 0 && rightPartition.size == 
right.references.size) {
+  and.withNewChildren(Seq(Literal(true, BooleanType), right))
+} else and
 }
   }
-
-  val sparkSession = fsRelation.sparkSession
-  val partitionColumns =
-logicalRelation.resolve(
-  partitionSchema, sparkSession.sessionState.analyzer.resolver)
-  val partitionSet = AttributeSet(partitionColumns)
   val partitionKeyFilters =
--- End diff --

@cloud-fan Sorry, I don't understand very clearly, the function of 
splitConjunctivePredicates can only split and(a,b); if there is a or expression 
in the filter , for example 'where (p_d=2 and key=2) or (p_d=3 and key=3)', the 
result of splitConjunctivePredicates is '(((p_d#2 = 2) && (key#0 = 2)) || 
((p_d#2 = 3) && (key#0 = 3)))', the partition expression could not be split out.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22561: [SPARK-25548][SQL]In the PruneFileSourcePartition...

2018-10-15 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/22561#discussion_r225050369
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala
 ---
@@ -39,21 +40,31 @@ private[sql] object PruneFileSourcePartitions extends 
Rule[LogicalPlan] {
 _,
 _))
 if filters.nonEmpty && fsRelation.partitionSchemaOption.isDefined 
=>
+
+  val sparkSession = fsRelation.sparkSession
+  val partitionColumns =
+logicalRelation.resolve(
+  partitionSchema, sparkSession.sessionState.analyzer.resolver)
+  val partitionSet = AttributeSet(partitionColumns)
   // The attribute name of predicate could be different than the one 
in schema in case of
   // case insensitive, we should change them to match the one in 
schema, so we donot need to
   // worry about case sensitivity anymore.
   val normalizedFilters = filters.map { e =>
-e transform {
+e transformUp {
   case a: AttributeReference =>
 
a.withName(logicalRelation.output.find(_.semanticEquals(a)).get.name)
+  // Replace the nonPartitionOps field with true in the 
And(partitionOps, nonPartitionOps)
+  // to make the partition can be pruned
+  case and @And(left, right) =>
+val leftPartition = 
left.references.filter(partitionSet.contains(_))
+val rightPartition = 
right.references.filter(partitionSet.contains(_))
+if (leftPartition.size == left.references.size && 
rightPartition.size == 0) {
--- End diff --

Ok, thanks for review.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22561: [SPARK-25548][SQL]In the PruneFileSourcePartitions optim...

2018-09-27 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/22561
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22561: [SPARK-25548][SQL]In the PruneFileSourcePartitions optim...

2018-09-26 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/22561
  
cc @gatorsmile @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22561: [SPARK-25548][SQL]In the PruneFileSourcePartition...

2018-09-26 Thread eatoncys
GitHub user eatoncys opened a pull request:

https://github.com/apache/spark/pull/22561

[SPARK-25548][SQL]In the PruneFileSourcePartitions optimizer, replace the 
nonPartitionOps field with true in the And(partitionOps, nonPartitionOps) to 
make the partition can be pruned

## What changes were proposed in this pull request?
In the PruneFileSourcePartitions optimizer, the partition files will not be 
pruned if we use partition filter and non partition filter together, for 
example:

sql("CREATE TABLE IF NOT EXISTS src_par (key INT, value STRING) 
partitioned by(p_d int) stored as parquet ")
sql("insert overwrite table src_par partition(p_d=2) select 2 as key, 
'4' as value")
sql("insert overwrite table src_par partition(p_d=3) select 3 as key, 
'4' as value")
sql("insert overwrite table src_par partition(p_d=4) select 4 as key, 
'4' as value")

Before this PR, the sql below will scan all the partition files, in which, 
the partition **p_d=4** should be pruned.
**sql("select * from src_par where (p_d=2 and key=2) or (p_d=3 and 
key=3)").show**
After this PR, the partition **p_d=4** will be pruned

## How was this patch tested?
exist test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/eatoncys/spark partitionFilter

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22561.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22561


commit 6acb460381c96fe71f807f94bb617f3928f41694
Author: 10129659 
Date:   2018-09-27T01:04:20Z

In the PruneFileSourcePartitions optimizer, replace the nonPartitionOps 
field with true in the And(partitionOps, nonPartitionOps) to make the partition 
can be pruned




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22053: [SPARK-25069][CORE]Using UnsafeAlignedOffset to make the...

2018-08-09 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/22053
  
@cloud-fan Unaligned accesses are not supported on SPARC architecture, 
which is discussed on the issure: 
https://issues.apache.org/jira/browse/SPARK-16962.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22053: [SPARK-25069][CORE]Using UnsafeAlignedOffset to make the...

2018-08-09 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/22053
  
@kiszk The comments updated , Thanks for review.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22053: [SPARK-25069][Core]Using UnsafeAlignedOffset to m...

2018-08-09 Thread eatoncys
GitHub user eatoncys opened a pull request:

https://github.com/apache/spark/pull/22053

[SPARK-25069][Core]Using UnsafeAlignedOffset to make the entire record of 8 
byte Items aligned like which is used in UnsafeExternalSorter 

## What changes were proposed in this pull request?

The class of UnsafeExternalSorter used UnsafeAlignedOffset to make the 
entire record of 8 byte Items aligned, but ShuffleExternalSorter not.
The SPARC platform requires this because using a 4 byte Int for record 
lengths causes the entire record of 8 byte Items to become misaligned by 4 
bytes. Using a 8 byte long for record length keeps things 8 byte aligned.

## How was this patch tested?
Existing Test.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/eatoncys/spark UnsafeAlignedOffset

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22053.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22053


commit 8559c454127904ad2d25930d6b743238f0fff46d
Author: 10129659 
Date:   2018-08-09T08:53:59Z

Aligned Offset for ShuffleExternalSorter




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21823: [SPARK-24870][SQL]Cache can't work normally if th...

2018-07-23 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/21823#discussion_r204597636
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/SameResultSuite.scala ---
@@ -58,4 +61,16 @@ class SameResultSuite extends QueryTest with 
SharedSQLContext {
 val df4 = spark.range(10).agg(sumDistinct($"id"))
 
assert(df3.queryExecution.executedPlan.sameResult(df4.queryExecution.executedPlan))
   }
+
+  test("Canonicalized result is not case-insensitive") {
--- End diff --

Modified, thanks.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21823: [SPARK-24870][SQL]Cache can't work normally if there are...

2018-07-23 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/21823
  
Can we merge it to master? @cloud-fan @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21823: [SPARK-24870][SQL]Cache can't work normally if th...

2018-07-20 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/21823#discussion_r204199119
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/SameResultSuite.scala ---
@@ -58,4 +61,16 @@ class SameResultSuite extends QueryTest with 
SharedSQLContext {
 val df4 = spark.range(10).agg(sumDistinct($"id"))
 
assert(df3.queryExecution.executedPlan.sameResult(df4.queryExecution.executedPlan))
   }
+
+  test("Canonicalized result is not case-insensitive") {
+val a = AttributeReference("A", IntegerType)()
+val b = AttributeReference("B", IntegerType)()
+val planUppercase = Project(Seq(a, b), LocalRelation(a))
--- End diff --

Ok,thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21823: [SPARK-24870][SQL]Cache can't work normally if th...

2018-07-20 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/21823#discussion_r203990617
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CanonicalizeSuite.scala
 ---
@@ -50,4 +52,30 @@ class CanonicalizeSuite extends SparkFunSuite {
 assert(range.where(arrays1).sameResult(range.where(arrays2)))
 assert(!range.where(arrays1).sameResult(range.where(arrays3)))
   }
+
+  test("Canonicalized result is not case-insensitive") {
--- End diff --

Ok,modified,thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21823: [SPARK-24870][SQL]Cache can't work normally if th...

2018-07-20 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/21823#discussion_r203972375
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala 
---
@@ -237,7 +239,7 @@ abstract class QueryPlan[PlanType <: 
QueryPlan[PlanType]] extends TreeNode[PlanT
 // Top level `AttributeReference` may also be used for output like 
`Alias`, we should
 // normalize the epxrId too.
 id += 1
-ar.withExprId(ExprId(id)).canonicalized
+
ar.withExprId(ExprId(id)).withName(ar.name.toLowerCase(Locale.ROOT)).canonicalized
--- End diff --

I think it is Ok, and it erase the attribute name in spark version 2.0.2.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21823: [SPARK-24870][SQL]Cache can't work normally if there are...

2018-07-20 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/21823
  
@cloud-fan why not fix this in doCanonicalize?  I think it is better to fix 
it in doCanonicalize, but I'm not very sure.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21823: [SPARK-24870][SQL]Cache can't work normally if there are...

2018-07-20 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/21823
  
 @cloud-fan  fix this in dedupRight is Ok, but maybe there are other 
operations like dedupRight  to change the case of the word.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21823: [SPARK-24870][SQL]Cache can't work normally if there are...

2018-07-20 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/21823
  
@cloud-fan 
  case j @ Join(left, right, _, _) if !j.duplicateResolved =>
j.copy(right = dedupRight(left, right))
dedupRight generate a new logical plan for the right child, which get the 
'key' from the original table 'src', but left not.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21823: [SPARK-24870][SQL]Cache can't work normally if there are...

2018-07-20 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/21823
  
@cloud-fan  Cast 'Key' to lower case is done by rule  of ResolveReferences:

![image](https://user-images.githubusercontent.com/26834091/42987332-7798ba3e-8c2b-11e8-9bed-d8be2ec7dad7.png)




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21823: [SPARK-24870][SQL]Cache can't work normally if there are...

2018-07-19 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/21823
  
cc @cloud-fan @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21823: [SPARK-24870][SQL]Cache can't work normally if th...

2018-07-19 Thread eatoncys
GitHub user eatoncys opened a pull request:

https://github.com/apache/spark/pull/21823

[SPARK-24870][SQL]Cache can't work normally if there are case letters in SQL

## What changes were proposed in this pull request?
Modified the canonicalized to not case-insensitive.
Before the PR, cache can't work normally if there are case letters in SQL, 
for example:
 sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING 
hive")

sql("select key, sum(case when Key > 0 then 1 else 0 end) as 
positiveNum " +
  "from src group by key").cache().createOrReplaceTempView("src_cache")
sql(
  s"""select a.key
   from
   (select key from src_cache where positiveNum = 1)a
   left join
   (select key from src_cache )b
   on a.key=b.key
""").explain

The physical plan of the sql is:

![image](https://user-images.githubusercontent.com/26834091/42979518-3decf0fa-8c05-11e8-9837-d5e4c334cb1f.png)

The subquery "select key from src_cache where positiveNum = 1" on the left 
of join can use the cache data, but the subquery "select key from src_cache" on 
the right of join cannot use the cache data.

## How was this patch tested?

new added test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/eatoncys/spark canonicalized

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21823.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21823


commit 2b2a5a33ed58ce07fd2515eb01e80acbedeb8b2a
Author: 10129659 
Date:   2018-07-20T01:43:53Z

Cache can't work normally if there are case letters in SQL




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19819: [SPARK-22606][Streaming]Add threadId to the Cache...

2018-07-17 Thread eatoncys
Github user eatoncys closed the pull request at:

https://github.com/apache/spark/pull/19819


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21084: [SPARK-23998][Core]It may be better to add @trans...

2018-04-18 Thread eatoncys
Github user eatoncys closed the pull request at:

https://github.com/apache/spark/pull/21084


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21084: [SPARK-23998][Core]It may be better to add @transient to...

2018-04-18 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/21084
  
@jerryshao @hvanhovell  Ok, I will close it, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21084: [SPARK-23998][Core]It may be better to add @transient to...

2018-04-18 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/21084
  
@jerryshao , There is not any issue without transient, but I think it is 
better to keep same to other fields, and make it clearly which fields do not 
need to be serialized.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21084: [SPARK-23998][Core]It may be better to add @transient to...

2018-04-18 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/21084
  
@jiangxb1987 It does not take  significant time to serialize the 
taskMemoryManager, because the value is null in driver side, but I think it is 
better to keep same to other fields in the Task class, which are only used in 
executor side, like '@volatile @transient private var _reasonIfKilled: String = 
null'.
In my test, the serialized size reduced from 8392 bytes to 8325 bytes.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21084: [SPARK-23998][Core]It may be better to add @transient to...

2018-04-17 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/21084
  
@hvanhovell The field 'taskMemoryManager' is only used in executor side, so 
it is not needed to serialize it when sending the task from driver to executor


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21084: [SPARK-23998][Core]It may be better to add @trans...

2018-04-17 Thread eatoncys
GitHub user eatoncys opened a pull request:

https://github.com/apache/spark/pull/21084

[SPARK-23998][Core]It may be better to add @transient to field 
'taskMemoryManager' in class Task, for it is only be set and used in executor 
side

Add @transient to field 'taskMemoryManager' in class Task, for it is only 
be set and used in executor side and it will be set before used in class 
Executor like this:
task.setTaskMemoryManager(taskMemoryManager)
before task.run


## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/eatoncys/spark transient

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21084.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21084


commit 283e1ac91996e375cb9d1775fef39ea29eb85325
Author: 10129659 <chen.yanshan@...>
Date:   2018-04-17T08:20:38Z

It may be better to add @transient to field 'taskMemoryManager' in class 
Task, for it is only be setted and used in executor




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19819: [SPARK-22606][Streaming]Add threadId to the Cache...

2017-11-25 Thread eatoncys
GitHub user eatoncys opened a pull request:

https://github.com/apache/spark/pull/19819

[SPARK-22606][Streaming]Add threadId to the CachedKafkaConsumer key

## What changes were proposed in this pull request?
If the value of param 'spark.streaming.concurrentJobs' is more than one, 
and the value of param 'spark.executor.cores' is more than one, there may be 
two or more tasks in one executor will use the same kafka consumer at the same 
time, then it will throw an exception: "KafkaConsumer is not safe for 
multi-threaded access";
for example:
spark.streaming.concurrentJobs=2
spark.executor.cores=2
spark.cores.max=2
if there is only one topic with one partition('topic1',0) to consume, there 
will be two jobs to run at the same time, and they will use the same 
cacheKey('groupid','topic1',0) to get the CachedKafkaConsumer from the cache 
list of' private var cache: ju.LinkedHashMap[CacheKey, CachedKafkaConsumer[_, 
_]]' , then it will get the same CachedKafkaConsumer.

this PR add threadId  to the CachedKafkaConsumer key to prevent two thread 
using a consumer at the same time.



## How was this patch tested?
existing ut test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/eatoncys/spark kafka

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19819.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19819


commit aa02d8904fcbaa91df47ac224d90345bd555a372
Author: 10129659 <chen.yans...@zte.com.cn>
Date:   2017-11-25T08:15:17Z

Add threadId to CachedKafkaConsumer key




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19022: [Spark-21807][SQL]Override ++ operation in Expres...

2017-08-23 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/19022#discussion_r134905955
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionSetSuite.scala
 ---
@@ -210,4 +210,13 @@ class ExpressionSetSuite extends SparkFunSuite {
 assert((initialSet - (aLower + 1)).size == 0)
 
   }
+
+  test("add multiple elements to set") {
+val initialSet = ExpressionSet(aUpper + 1 :: Nil)
+val setToAddWithSameExpression = ExpressionSet(aUpper + 1 :: aUpper + 
2 :: Nil)
+val setToAdd = ExpressionSet(aUpper + 2 :: aUpper + 3 :: Nil)
--- End diff --

Yes, I have modified the name to setToAddWithOutSameExpression.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19022: [Spark-21807][SQL]Override ++ operation in Expres...

2017-08-23 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/19022#discussion_r134905739
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExpressionSet.scala
 ---
@@ -17,7 +17,7 @@
 
 package org.apache.spark.sql.catalyst.expressions
 
-import scala.collection.mutable
+import scala.collection.{GenTraversableOnce, mutable}
--- End diff --

Ok, modified it, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19022: [Spark-21807][SQL]The getAliasedConstraints funct...

2017-08-23 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/19022#discussion_r134703026
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExpressionSet.scala
 ---
@@ -59,6 +59,12 @@ class ExpressionSet protected(
 }
   }
 
+  def addMultiExpressions(elems: Set[Expression]): ExpressionSet = {
--- End diff --

Ok, Added, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19022: [Spark-21807][SQL]The getAliasedConstraints funct...

2017-08-23 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/19022#discussion_r134695982
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExpressionSet.scala
 ---
@@ -59,6 +59,12 @@ class ExpressionSet protected(
 }
   }
 
+  def addMultiExpressions(elems: Set[Expression]): ExpressionSet = {
--- End diff --

Ok, Modified it, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19022: [Spark-21807][SQL]The getAliasedConstraints funct...

2017-08-22 Thread eatoncys
GitHub user eatoncys opened a pull request:

https://github.com/apache/spark/pull/19022

[Spark-21807][SQL]The getAliasedConstraints function  in LogicalPlan will 
take a long time when number of expressions is greater than 100

## What changes were proposed in this pull request?
The getAliasedConstraints  fuction in LogicalPlan.scala will clone the 
expression set when an element added,
and it will take a long time. This PR add a function to add multiple 
elements at once to reduce the clone time.

Before modified, the cost of getAliasedConstraints is:
100 expressions:  41 seconds
150 expressions:  466 seconds

After modified, the cost of getAliasedConstraints is:
100 expressions:  1.8 seconds
150 expressions:  6.5 seconds

The test is like this:
test("getAliasedConstraints") {
val expressionNum = 150
val aggExpression = (1 to expressionNum).map(i => 
Alias(Count(Literal(1)), s"cnt$i")())
val aggPlan = Aggregate(Nil, aggExpression, LocalRelation())

val beginTime = System.currentTimeMillis()
val expressions = aggPlan.validConstraints
println(s"validConstraints cost: ${System.currentTimeMillis() - 
beginTime}ms")
// The size of Aliased expression is n * (n - 1) / 2 + n
assert( expressions.size === expressionNum * (expressionNum - 1) / 2 + 
expressionNum)
  }



(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Run new added test.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/eatoncys/spark getAliasedConstraints

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19022.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19022


commit 80af01add6c0169c7cad0286afc748d845cd1327
Author: 10129659 <chen.yans...@zte.com.cn>
Date:   2017-08-22T07:47:27Z

The getAliasedConstraints function will take a long time when expression is 
greater than 100




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18810: [SPARK-21603][SQL]The wholestage codegen will be ...

2017-08-11 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18810#discussion_r132806724
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -572,6 +572,14 @@ object SQLConf {
   "disable logging or -1 to apply no limit.")
 .createWithDefault(1000)
 
+  val WHOLESTAGE_MAX_LINES_PER_FUNCTION = 
buildConf("spark.sql.codegen.maxLinesPerFunction")
+.internal()
+.doc("The maximum lines of a single Java function generated by 
whole-stage codegen. " +
+  "When the generated function exceeds this threshold, " +
+  "the whole-stage codegen is deactivated for this subtree of the 
current query plan.")
+.intConf
+.createWithDefault(1500)
--- End diff --

@kiszk, you're right, it depends on how much byte code per line.
@gatorsmile, ok, we take a conservative value 2730 (8192 / 3) first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18810: [SPARK-21603][SQL]The wholestage codegen will be ...

2017-08-10 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18810#discussion_r132616342
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -572,6 +572,14 @@ object SQLConf {
   "disable logging or -1 to apply no limit.")
 .createWithDefault(1000)
 
+  val WHOLESTAGE_MAX_LINES_PER_FUNCTION = 
buildConf("spark.sql.codegen.maxLinesPerFunction")
+.internal()
+.doc("The maximum lines of a single Java function generated by 
whole-stage codegen. " +
+  "When the generated function exceeds this threshold, " +
+  "the whole-stage codegen is deactivated for this subtree of the 
current query plan.")
+.intConf
+.createWithDefault(1500)
--- End diff --

@gatorsmile, Which do you think better to use for the default value, 1500 
or Int.Max ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18810: [SPARK-21603][SQL]The wholestage codegen will be ...

2017-08-10 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18810#discussion_r132616033
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala
 ---
@@ -149,4 +150,56 @@ class WholeStageCodegenSuite extends SparkPlanTest 
with SharedSQLContext {
   assert(df.collect() === Array(Row(1), Row(2)))
 }
   }
+
+  def genGroupByCodeGenContext(caseNum: Int, maxLinesPerFunction: Int): 
CodegenContext = {
+val caseExp = (1 to caseNum).map { i =>
+  s"case when id > $i and id <= ${i + 1} then 1 else 0 end as v$i"
+}.toList
+
+spark.conf.set("spark.sql.codegen.maxLinesPerFunction", 
maxLinesPerFunction)
--- End diff --

Ok, modified, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18810: [SPARK-21603][SQL]The wholestage codegen will be ...

2017-08-10 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18810#discussion_r132610861
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -572,6 +572,14 @@ object SQLConf {
   "disable logging or -1 to apply no limit.")
 .createWithDefault(1000)
 
+  val WHOLESTAGE_MAX_LINES_PER_FUNCTION = 
buildConf("spark.sql.codegen.maxLinesPerFunction")
+.internal()
+.doc("The maximum lines of a single Java function generated by 
whole-stage codegen. " +
+  "When the generated function exceeds this threshold, " +
+  "the whole-stage codegen is deactivated for this subtree of the 
current query plan.")
+.intConf
+.createWithDefault(1500)
--- End diff --

I think it applies to other Java programs using JAVA HotSpot VM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18810: [SPARK-21603][SQL]The wholestage codegen will be ...

2017-08-10 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18810#discussion_r132610543
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala
 ---
@@ -149,4 +149,75 @@ class WholeStageCodegenSuite extends SparkPlanTest 
with SharedSQLContext {
   assert(df.collect() === Array(Row(1), Row(2)))
 }
   }
+
+  test("SPARK-21603 check there is a too long generated function") {
+val ds = spark.range(10)
--- End diff --

Ok, I have modified it as you suggested above all, would you like to review 
it again, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18810: [SPARK-21603][SQL]The wholestage codegen will be ...

2017-08-10 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18810#discussion_r132388819
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala
 ---
@@ -370,6 +370,14 @@ case class WholeStageCodegenExec(child: SparkPlan) 
extends UnaryExecNode with Co
 
   override def doExecute(): RDD[InternalRow] = {
 val (ctx, cleanedSource) = doCodeGen()
+if (ctx.isTooLongGeneratedFunction) {
+  logWarning("Found too long generated codes and JIT optimization 
might not work, " +
+"Whole-stage codegen disabled for this plan, " +
+"You can change the config spark.sql.codegen.MaxFunctionLength " +
+"to adjust the function length limit:\n "
++ s"$treeString")
+  return child.execute()
+}
--- End diff --

Ok ,I have added a test, would you like to review it again, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18810: [SPARK-21603][SQL]The wholestage codegen will be ...

2017-08-10 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18810#discussion_r132376473
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala
 ---
@@ -370,6 +370,14 @@ case class WholeStageCodegenExec(child: SparkPlan) 
extends UnaryExecNode with Co
 
   override def doExecute(): RDD[InternalRow] = {
 val (ctx, cleanedSource) = doCodeGen()
+if (ctx.isTooLongGeneratedFunction) {
+  logWarning("Found too long generated codes and JIT optimization 
might not work, " +
+"Whole-stage codegen disabled for this plan, " +
+"You can change the config spark.sql.codegen.MaxFunctionLength " +
+"to adjust the function length limit:\n "
++ s"$treeString")
+  return child.execute()
+}
--- End diff --

When we  check "ctx.isTooLongGeneratedFunction" in doExecute, the 
WholeStageCodegenExec node is generated alreay, so there must be 
WholeStageCodegenExec node at this point.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18810: [SPARK-21603][SQL]The wholestage codegen will be ...

2017-08-10 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18810#discussion_r132374541
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala
 ---
@@ -370,6 +370,14 @@ case class WholeStageCodegenExec(child: SparkPlan) 
extends UnaryExecNode with Co
 
   override def doExecute(): RDD[InternalRow] = {
 val (ctx, cleanedSource) = doCodeGen()
+if (ctx.isTooLongGeneratedFunction) {
+  logWarning("Found too long generated codes and JIT optimization 
might not work, " +
+"Whole-stage codegen disabled for this plan, " +
+"You can change the config spark.sql.codegen.MaxFunctionLength " +
+"to adjust the function length limit:\n "
++ s"$treeString")
+  return child.execute()
+}
--- End diff --

@viirya, it is hard to check  if whole-stage codegen is disabled or not for 
me, would you like to give me some suggestion, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18810: [SPARK-21603][SQL]The wholestage codegen will be ...

2017-08-10 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18810#discussion_r132370096
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeFormatter.scala
 ---
@@ -89,6 +89,14 @@ object CodeFormatter {
 }
 new CodeAndComment(code.result().trim(), map)
   }
+
+  def stripExtraNewLinesAndComments(input: String): String = {
+val commentReg =
+  ("""([ |\t]*?\/\*[\s|\S]*?\*\/[ |\t]*?)|""" +   // strip /*comment*/
+"""([ |\t]*?\/\/[\s\S]*?\n)""").r   // strip //comment
--- End diff --

Ok,modified, thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...

2017-08-09 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18810#discussion_r132368646
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -572,6 +572,14 @@ object SQLConf {
   "disable logging or -1 to apply no limit.")
 .createWithDefault(1000)
 
+  val WHOLESTAGE_MAX_LINES_PER_FUNCTION = 
buildConf("spark.sql.codegen.maxLinesPerFunction")
+.internal()
+.doc("The maximum lines of a single Java function generated by 
whole-stage codegen. " +
+  "When the generated function exceeds this threshold, " +
+  "the whole-stage codegen is deactivated for this subtree of the 
current query plan.")
+.intConf
+.createWithDefault(1500)
--- End diff --

When I modified it to 1600, the result is:
max function length of wholestagecodegen: Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative


codegen = F467 /  507  1.4  
   712.7   1.0X
codegen = T maxLinesPerFunction = 16003191 / 3238  0.2  
  4868.7   0.1X
codegen = T maxLinesPerFunction = 1500 449 /  482  1.5  
   685.2   1.0X


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...

2017-08-09 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18810#discussion_r132368484
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala
 ---
@@ -370,6 +370,14 @@ case class WholeStageCodegenExec(child: SparkPlan) 
extends UnaryExecNode with Co
 
   override def doExecute(): RDD[InternalRow] = {
 val (ctx, cleanedSource) = doCodeGen()
+if (ctx.isTooLongGeneratedFunction) {
+  logWarning("Found too long generated codes and JIT optimization 
might not work, " +
+"Whole-stage codegen disabled for this plan, " +
+"You can change the config spark.sql.codegen.MaxFunctionLength " +
+"to adjust the function length limit:\n "
++ s"$treeString")
+  return child.execute()
+}
--- End diff --

I think it can tested by " max function length of wholestagecodegen" added 
in AggregateBenchmark.scala, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...

2017-08-09 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18810#discussion_r132365359
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -572,6 +572,13 @@ object SQLConf {
   "disable logging or -1 to apply no limit.")
 .createWithDefault(1000)
 
+  val WHOLESTAGE_MAX_LINES_PER_FUNCTION = 
buildConf("spark.sql.codegen.maxLinesPerFunction")
+.internal()
+.doc("The maximum lines of a function that will be supported before" +
+  " deactivating whole-stage codegen.")
--- End diff --

Ok,updated,thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...

2017-08-09 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18810#discussion_r132365401
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -356,6 +356,19 @@ class CodegenContext {
   private val placeHolderToComments = new mutable.HashMap[String, String]
 
   /**
+   * Returns if there is a codegen function the lines of which is greater 
than maxLinesPerFunction
+   * It will count the lines of every codegen function, if there is a 
function of length
+   * greater than spark.sql.codegen.maxLinesPerFunction, it will return 
true.
+   */
+  def existTooLongFunction(): Boolean = {
--- End diff --

Ok,updated,thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...

2017-08-09 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18810#discussion_r132365436
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -356,6 +356,19 @@ class CodegenContext {
   private val placeHolderToComments = new mutable.HashMap[String, String]
 
   /**
+   * Returns if there is a codegen function the lines of which is greater 
than maxLinesPerFunction
+   * It will count the lines of every codegen function, if there is a 
function of length
+   * greater than spark.sql.codegen.maxLinesPerFunction, it will return 
true.
+   */
+  def existTooLongFunction(): Boolean = {
+classFunctions.exists { case (className, functions) =>
+  functions.exists{ case (name, code) =>
+val codeWithoutComments = 
CodeFormatter.stripExtraNewLinesAndComments(code)
+codeWithoutComments.count(_ == '\n') > 
SQLConf.get.maxLinesPerFunction
+  }
+}
+  }
+  /**
--- End diff --

Ok, added, thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...

2017-08-09 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18810#discussion_r132347436
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -356,6 +356,18 @@ class CodegenContext {
   private val placeHolderToComments = new mutable.HashMap[String, String]
 
   /**
+   * Returns if the length of codegen function is too long or not
--- End diff --

Ok, I have modified it, thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...

2017-08-09 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18810#discussion_r132347148
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -356,6 +356,18 @@ class CodegenContext {
   private val placeHolderToComments = new mutable.HashMap[String, String]
 
   /**
+   * Returns if the length of codegen function is too long or not
+   * It will count the lines of every codegen function, if there is a 
function of length
+   * greater than spark.sql.codegen.MaxFunctionLength, it will return true.
+   */
+  def existTooLongFunction(): Boolean = {
+classFunctions.exists { case (className, functions) =>
+  functions.exists{ case (name, code) =>
+CodeFormatter.stripExtraNewLines(code).count(_ == '\n') > 
SQLConf.get.maxFunctionLength
--- End diff --

Ok, I have modified it to count lines without comments and extra new lines


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...

2017-08-09 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18810#discussion_r132347198
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -572,6 +572,13 @@ object SQLConf {
   "disable logging or -1 to apply no limit.")
 .createWithDefault(1000)
 
+  val WHOLESTAGE_MAX_FUNCTION_LEN = 
buildConf("spark.sql.codegen.MaxFunctionLength")
--- End diff --

Ok, I have modified it, thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...

2017-08-09 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18810#discussion_r132347018
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala
 ---
@@ -301,6 +301,61 @@ class AggregateBenchmark extends BenchmarkBase {
 */
   }
 
+  ignore("max function length of wholestagecodegen") {
+val N = 20 << 15
+
+val benchmark = new Benchmark("max function length of 
wholestagecodegen", N)
+def f(): Unit = sparkSession.range(N)
+  .selectExpr(
+"id",
+"(id & 1023) as k1",
+"cast(id & 1023 as double) as k2",
+"cast(id & 1023 as int) as k3",
+"case when id > 100 and id <= 200 then 1 else 0 end as v1",
+"case when id > 200 and id <= 300 then 1 else 0 end as v2",
+"case when id > 300 and id <= 400 then 1 else 0 end as v3",
+"case when id > 400 and id <= 500 then 1 else 0 end as v4",
+"case when id > 500 and id <= 600 then 1 else 0 end as v5",
+"case when id > 600 and id <= 700 then 1 else 0 end as v6",
+"case when id > 700 and id <= 800 then 1 else 0 end as v7",
+"case when id > 800 and id <= 900 then 1 else 0 end as v8",
+"case when id > 900 and id <= 1000 then 1 else 0 end as v9",
+"case when id > 1000 and id <= 1100 then 1 else 0 end as v10",
+"case when id > 1100 and id <= 1200 then 1 else 0 end as v11",
+"case when id > 1200 and id <= 1300 then 1 else 0 end as v12",
+"case when id > 1300 and id <= 1400 then 1 else 0 end as v13",
+"case when id > 1400 and id <= 1500 then 1 else 0 end as v14",
+"case when id > 1500 and id <= 1600 then 1 else 0 end as v15",
+"case when id > 1600 and id <= 1700 then 1 else 0 end as v16",
+"case when id > 1700 and id <= 1800 then 1 else 0 end as v17",
+"case when id > 1800 and id <= 1900 then 1 else 0 end as v18")
+  .groupBy("k1", "k2", "k3")
+  .sum()
+  .collect()
+
+benchmark.addCase(s"codegen = F") { iter =>
+  sparkSession.conf.set("spark.sql.codegen.wholeStage", "false")
+  f()
+}
+
+benchmark.addCase(s"codegen = T") { iter =>
+  sparkSession.conf.set("spark.sql.codegen.wholeStage", "true")
+  sparkSession.conf.set("spark.sql.codegen.MaxFunctionLength", "1")
--- End diff --

Ok, I have added a test use the default number 1500, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18810: [SPARK-21603][sql]The wholestage codegen will be much sl...

2017-08-08 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/18810
  
cc @gatorsmile 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...

2017-08-07 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18810#discussion_r131585903
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -356,6 +356,16 @@ class CodegenContext {
   private val placeHolderToComments = new mutable.HashMap[String, String]
 
   /**
+   * Returns the length of codegen function  is too long or not
+   */
+  def existTooLongFunction(): Boolean = {
--- End diff --

Ok, I have modified it, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...

2017-08-07 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18810#discussion_r131593593
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala
 ---
@@ -370,6 +370,12 @@ case class WholeStageCodegenExec(child: SparkPlan) 
extends UnaryExecNode with Co
 
   override def doExecute(): RDD[InternalRow] = {
 val (ctx, cleanedSource) = doCodeGen()
+val existLongFunction = ctx.existTooLongFunction
+if (existLongFunction) {
+  logWarning(s"Function is too long, Whole-stage codegen disabled for 
this plan:\n "
++ s"$treeString")
--- End diff --

@gatorsmile , thank you for review, the treeString not contains the code, 
it only contains the tree string of the Physical plan like below:
*HashAggregate(keys=[k1#2395L, k2#2396, k3#2397], 
functions=[partial_sum(id#2392L)...
+- *Project [id#2392L, (id#2392L & 1023) AS k1#2395L, cast((id#2392L & 
1023) as double) AS k2#2396...
+- *Range (0, 655360, step=1, splits=1)
So, I think it will not be very big.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...

2017-08-07 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18810#discussion_r131585857
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -356,6 +356,18 @@ class CodegenContext {
   private val placeHolderToComments = new mutable.HashMap[String, String]
 
   /**
+   * Returns the length of codegen function  is too long or not
--- End diff --

Ok, I have modified it, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...

2017-08-04 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18810#discussion_r131340166
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -356,6 +356,16 @@ class CodegenContext {
   private val placeHolderToComments = new mutable.HashMap[String, String]
 
   /**
+   * Returns the length of codegen function  is too long or not
+   */
+  def existTooLongFunction(): Boolean = {
+classFunctions.exists { case (className, functions) =>
+  functions.exists{ case (name, code) =>
+CodeFormatter.stripExtraNewLines(code).count(_ == '\n') > 
SQLConf.get.maxFunctionLength
--- End diff --

@kiszk Because when the JVM parameter -XX:+DontCompileHugeMethods is true, 
it can not get the JIT optimization when the byte code of a function is longer 
than 8000, here I just estimate a function lines by 8000 byte code, maybe there 
are some other good ways.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...

2017-08-01 Thread eatoncys
GitHub user eatoncys opened a pull request:

https://github.com/apache/spark/pull/18810

[SPARK-21603][sql]The wholestage codegen will be much slower then 
wholestage codegen is closed when the function is too long

## What changes were proposed in this pull request?
Close the whole stage codegen when the function lines is longer than the 
maxlines which will be setted by
spark.sql.codegen.MaxFunctionLength parameter, because when the function is 
too long , it will not get the JIT  optimizing.
A benchmark test result is 10x slower when the generated function is too 
long :

ignore("max function length of wholestagecodegen") {
val N = 20 << 15

val benchmark = new Benchmark("max function length of 
wholestagecodegen", N)
def f(): Unit = sparkSession.range(N)
  .selectExpr(
"id",
"(id & 1023) as k1",
"cast(id & 1023 as double) as k2",
"cast(id & 1023 as int) as k3",
"case when id > 100 and id <= 200 then 1 else 0 end as v1",
"case when id > 200 and id <= 300 then 1 else 0 end as v2",
"case when id > 300 and id <= 400 then 1 else 0 end as v3",
"case when id > 400 and id <= 500 then 1 else 0 end as v4",
"case when id > 500 and id <= 600 then 1 else 0 end as v5",
"case when id > 600 and id <= 700 then 1 else 0 end as v6",
"case when id > 700 and id <= 800 then 1 else 0 end as v7",
"case when id > 800 and id <= 900 then 1 else 0 end as v8",
"case when id > 900 and id <= 1000 then 1 else 0 end as v9",
"case when id > 1000 and id <= 1100 then 1 else 0 end as v10",
"case when id > 1100 and id <= 1200 then 1 else 0 end as v11",
"case when id > 1200 and id <= 1300 then 1 else 0 end as v12",
"case when id > 1300 and id <= 1400 then 1 else 0 end as v13",
"case when id > 1400 and id <= 1500 then 1 else 0 end as v14",
"case when id > 1500 and id <= 1600 then 1 else 0 end as v15",
"case when id > 1600 and id <= 1700 then 1 else 0 end as v16",
"case when id > 1700 and id <= 1800 then 1 else 0 end as v17",
"case when id > 1800 and id <= 1900 then 1 else 0 end as v18")
  .groupBy("k1", "k2", "k3")
  .sum()
  .collect()

benchmark.addCase(s"codegen = F") { iter =>
  sparkSession.conf.set("spark.sql.codegen.wholeStage", "false")
  f()
}

benchmark.addCase(s"codegen = T") { iter =>
  sparkSession.conf.set("spark.sql.codegen.wholeStage", "true")
  sparkSession.conf.set("spark.sql.codegen.MaxFunctionLength", "1")
  f()
}

benchmark.run()

/*
Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14 on Windows 7 6.1
Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
max function length of wholestagecodegen: Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative

----
codegen = F443 /  507  1.5  
   676.0   1.0X
codegen = T   3279 / 3283  0.2  
  5002.6   0.1X
 */
  }


## How was this patch tested?
Run the unit test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/eatoncys/spark codegen

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18810.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18810


commit ca9eff68424511fa11cc2bd695f1fddaae178e3c
Author: 10129659 <chen.yans...@zte.com.cn>
Date:   2017-08-02T03:48:21Z

The wholestage codegen will be slower when the function is too long

commit 1b0ac5ed896136df3579a61d7ef93980c0647e97
Author: 10129659 <chen.yans...@zte.com.cn>
Date:   2017-08-02T04:41:24Z

The wholestage codegen will be slower when the function is too long




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18322: [SPARK-21115][Core]If the cores left is less than...

2017-06-22 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18322#discussion_r123475601
  
--- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala ---
@@ -543,6 +543,30 @@ class SparkConf(loadDefaults: Boolean) extends 
Cloneable with Logging with Seria
   }
 }
 
+if (contains("spark.cores.max")) {
+  val totalCores = getInt("spark.cores.max", -1)
+  if (totalCores <= 0) {
+throw new IllegalArgumentException(s"spark.cores.max (was 
${get("spark.cores.max")})" +
+  s" can only be a positive number")
+  }
+}
+if (contains("spark.executor.cores")) {
+  val executorCores = getInt("spark.executor.cores", -1)
+  if (executorCores <= 0) {
+throw new IllegalArgumentException(s"spark.executor.cores " +
+  s"(was ${get("spark.executor.cores")}) can only be a positive 
number")
+  }
+}
+if (contains("spark.cores.max") && contains("spark.executor.cores")) {
+  val totalCores = getInt("spark.cores.max", 1)
--- End diff --

@srowen @jiangxb1987 Ok, I have removed the argument checking code, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18322: [SPARK-21115][Core]If the cores left is less than...

2017-06-22 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18322#discussion_r123461583
  
--- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala ---
@@ -543,6 +543,30 @@ class SparkConf(loadDefaults: Boolean) extends 
Cloneable with Logging with Seria
   }
 }
 
+if (contains("spark.cores.max")) {
+  val totalCores = getInt("spark.cores.max", -1)
+  if (totalCores <= 0) {
+throw new IllegalArgumentException(s"spark.cores.max (was 
${get("spark.cores.max")})" +
+  s" can only be a positive number")
+  }
+}
+if (contains("spark.executor.cores")) {
+  val executorCores = getInt("spark.executor.cores", -1)
+  if (executorCores <= 0) {
+throw new IllegalArgumentException(s"spark.executor.cores " +
+  s"(was ${get("spark.executor.cores")}) can only be a positive 
number")
+  }
+}
+if (contains("spark.cores.max") && contains("spark.executor.cores")) {
+  val totalCores = getInt("spark.cores.max", 1)
--- End diff --

@srowen Users may set these configuration via SparkConf object in a 
programmatic way after arg checking for spark-submit. For example, an app code 
may be like this: "val conf = new SparkConf().set("spark.executor.cores","-1")" 
before SparkContext created, the checkings here can get the error and exit 
directly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18322: [SPARK-21115][Core]If the cores left is less than the co...

2017-06-22 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/18322
  
cc @srowen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18322: [SPARK-21115][Core]If the cores left is less than...

2017-06-21 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18322#discussion_r123420930
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala ---
@@ -258,23 +256,7 @@ private[deploy] class SparkSubmitArguments(args: 
Seq[String], env: Map[String, S
 if (mainClass == null && SparkSubmit.isUserJar(primaryResource)) {
   SparkSubmit.printErrorAndExit("No main class set in JAR; please 
specify one with --class")
 }
-if (driverMemory != null
-&& Try(JavaUtils.byteStringAsBytes(driverMemory)).getOrElse(-1L) 
<= 0) {
-  SparkSubmit.printErrorAndExit("Driver Memory must be a positive 
number")
-}
-if (executorMemory != null
-&& Try(JavaUtils.byteStringAsBytes(executorMemory)).getOrElse(-1L) 
<= 0) {
-  SparkSubmit.printErrorAndExit("Executor Memory cores must be a 
positive number")
-}
-if (executorCores != null && Try(executorCores.toInt).getOrElse(-1) <= 
0) {
-  SparkSubmit.printErrorAndExit("Executor cores must be a positive 
number")
-}
-if (totalExecutorCores != null && 
Try(totalExecutorCores.toInt).getOrElse(-1) <= 0) {
-  SparkSubmit.printErrorAndExit("Total executor cores must be a 
positive number")
-}
-if (numExecutors != null && Try(numExecutors.toInt).getOrElse(-1) <= 
0) {
-  SparkSubmit.printErrorAndExit("Number of executors must be a 
positive number")
-}
--- End diff --

@jerryshao  Ok ,I have moved them back, thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18322: [SPARK-21115][Core]If the cores left is less than...

2017-06-21 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18322#discussion_r123420644
  
--- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala ---
@@ -543,6 +545,42 @@ class SparkConf(loadDefaults: Boolean) extends 
Cloneable with Logging with Seria
   }
 }
 
+if (contains("spark.driver.memory")) {
+  val driverMemory = get("spark.driver.memory")
+  if (Try(JavaUtils.byteStringAsBytes(driverMemory)).getOrElse(-1L) <= 
0) {
+throw new IllegalArgumentException(s"spark.driver.memory " +
+  s"(was ${driverMemory}) can only be a positive number")
+  }
+}
+if (contains("spark.executor.memory")) {
+  val executorMemory = get("spark.executor.memory")
+  if (Try(JavaUtils.byteStringAsBytes(executorMemory)).getOrElse(-1L) 
<= 0) {
+throw new IllegalArgumentException(s"spark.executor.memory " +
+  s"(was ${executorMemory}) can only be a positive number")
+  }
--- End diff --

@jerryshao  Ok ,I have moved them back, thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18322: [SPARK-21115][Core]If the cores left is less than...

2017-06-21 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18322#discussion_r123420083
  
--- Diff: 
core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala ---
@@ -704,6 +707,43 @@ class MasterSuite extends SparkFunSuite
   private def getState(master: Master): RecoveryState.Value = {
 master.invokePrivate(_state())
   }
+
+  test("Total cores is not divisible by cores per executor") {
--- End diff --

@jerryshao  Ok, I have removed them out, thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18351: [SPARK-21135][WEB UI] On history server page,duration ...

2017-06-20 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/18351
  
I think it is better to hide it. @fjh100456 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18351: [SPARK-21135][WEB UI] On history server page,duration ...

2017-06-20 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/18351
  
I think it is better to hide it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18351: [SPARK-21135][WEB UI] On history server page,duration ...

2017-06-20 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/18351
  
I think it is better to hide it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18322: [SPARK-21115][Core]If the cores left is less than...

2017-06-20 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18322#discussion_r122942651
  
--- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala ---
@@ -543,6 +543,30 @@ class SparkConf(loadDefaults: Boolean) extends 
Cloneable with Logging with Seria
   }
 }
 
+if (contains("spark.cores.max")) {
+  val totalCores = getInt("spark.cores.max", -1)
+  if (totalCores <= 0) {
+throw new IllegalArgumentException(s"spark.cores.max (was 
${get("spark.cores.max")})" +
+  s" can only be a positive number")
+  }
+}
+if (contains("spark.executor.cores")) {
+  val executorCores = getInt("spark.executor.cores", -1)
+  if (executorCores <= 0) {
+throw new IllegalArgumentException(s"spark.executor.cores " +
+  s"(was ${get("spark.executor.cores")}) can only be a positive 
number")
+  }
+}
+if (contains("spark.cores.max") && contains("spark.executor.cores")) {
--- End diff --

@jerryshao I put the negative check here first, but I think the app should  
exit directly if the cores is  negative, so I move them out. And @srowen thinks 
these checks for negative numbers are redundant with arg checking for 
spark-submit, it may be a good way to move the checkings from spark-submit to 
here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18322: [SPARK-21115][Core]If the cores left is less than...

2017-06-20 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18322#discussion_r122939834
  
--- Diff: 
core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala ---
@@ -704,6 +707,43 @@ class MasterSuite extends SparkFunSuite
   private def getState(master: Master): RecoveryState.Value = {
 master.invokePrivate(_state())
   }
+
+  test("Total cores is not divisible by cores per executor") {
--- End diff --

@jerryshao I have not any good way to test like this, any good suggestion? 
@srowen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18322: [SPARK-21115][Core]If the cores left is less than...

2017-06-20 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18322#discussion_r122926099
  
--- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala ---
@@ -543,6 +543,30 @@ class SparkConf(loadDefaults: Boolean) extends 
Cloneable with Logging with Seria
   }
 }
 
+if (contains("spark.cores.max")) {
+  val totalCores = getInt("spark.cores.max", -1)
--- End diff --

@jerryshao I don't understand very cleanly, if we don't set this 
configuration ,the "if (contains("spark.cores.max")) " will not got into.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18322: [SPARK-21115][Core]If the cores left is less than...

2017-06-20 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18322#discussion_r122925526
  
--- Diff: 
core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala ---
@@ -704,6 +707,43 @@ class MasterSuite extends SparkFunSuite
   private def getState(master: Master): RecoveryState.Value = {
 master.invokePrivate(_state())
   }
+
+  test("Total cores is not divisible by cores per executor") {
--- End diff --

@jerryshao The result is same before and after my change, how to test them 
differently, any suggestion? thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18322: [SPARK-21115][Core]If the cores left is less than...

2017-06-19 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18322#discussion_r122676148
  
--- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala ---
@@ -543,6 +543,30 @@ class SparkConf(loadDefaults: Boolean) extends 
Cloneable with Logging with Seria
   }
 }
 
+if (contains("spark.cores.max")) {
--- End diff --

@srowen  Users may set these configuration via SparkConf object in a 
programmatic way, can I move the checkings from spark–submit to here, or 
removed it here directly,which is better?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18322: [SPARK-21115][Core]If the cores left is less than the co...

2017-06-19 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/18322
  
@jerryshao, I have added a unit test in MasterSuite, would you like to 
review it again, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18322: [SPARK-21115][Core]If the cores left is less than...

2017-06-19 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18322#discussion_r122638913
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala ---
@@ -278,6 +278,14 @@ private[deploy] class SparkSubmitArguments(args: 
Seq[String], env: Map[String, S
 if (pyFiles != null && !isPython) {
   SparkSubmit.printErrorAndExit("--py-files given but primary resource 
is not a Python script")
 }
+if (totalExecutorCores != null && executorCores != null) {
--- End diff --

Ok, I have moved it to SparkConf, would you like to review it again, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18322: [SPARK-21115][Core]If the cores left is less than...

2017-06-17 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18322#discussion_r122562905
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala 
---
@@ -658,19 +658,22 @@ private[deploy] class Master(
   private def startExecutorsOnWorkers(): Unit = {
 // Right now this is a very simple FIFO scheduler. We keep trying to 
fit in the first app
 // in the queue, then the second app, etc.
-for (app <- waitingApps if app.coresLeft > 0) {
+for (app <- waitingApps) {
   val coresPerExecutor: Option[Int] = app.desc.coresPerExecutor
-  // Filter out workers that don't have enough resources to launch an 
executor
-  val usableWorkers = workers.toArray.filter(_.state == 
WorkerState.ALIVE)
-.filter(worker => worker.memoryFree >= 
app.desc.memoryPerExecutorMB &&
-  worker.coresFree >= coresPerExecutor.getOrElse(1))
-.sortBy(_.coresFree).reverse
-  val assignedCores = scheduleExecutorsOnWorkers(app, usableWorkers, 
spreadOutApps)
-
-  // Now that we've decided how many cores to allocate on each worker, 
let's allocate them
-  for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) {
-allocateWorkerResourceToExecutors(
-  app, assignedCores(pos), coresPerExecutor, usableWorkers(pos))
+  // If the cores left is less than the coresPerExecutor,the cores 
left will not be allocated
+  if (app.coresLeft >= coresPerExecutor.getOrElse(1)) {
--- End diff --

Ok, I've modified the expression "val coresPerExecutor = 
app.desc.coresPerExecutor" to "val coresPerExecutor = 
app.desc.coresPerExecutor.getOrElse(1)" and reused it. And then using 
"app.desc.coresPerExecutor" directly in the function 
allocateWorkerResourceToExecutors. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18322: [SPARK-21115][Core]If the cores left is less than...

2017-06-17 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18322#discussion_r122562797
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala ---
@@ -278,6 +278,12 @@ private[deploy] class SparkSubmitArguments(args: 
Seq[String], env: Map[String, S
 if (pyFiles != null && !isPython) {
   SparkSubmit.printErrorAndExit("--py-files given but primary resource 
is not a Python script")
 }
+if (totalExecutorCores != null && executorCores != null
+&& (totalExecutorCores.toInt % executorCores.toInt) != 0) {
--- End diff --

Ok, I've modified the  mod expression repeated  to a val param and reused 
it,thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18322: [SPARK-21115][Core]If the cores left is less than...

2017-06-16 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18322#discussion_r122439758
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala 
---
@@ -658,19 +658,22 @@ private[deploy] class Master(
   private def startExecutorsOnWorkers(): Unit = {
 // Right now this is a very simple FIFO scheduler. We keep trying to 
fit in the first app
 // in the queue, then the second app, etc.
-for (app <- waitingApps if app.coresLeft > 0) {
+for (app <- waitingApps) {
--- End diff --

@srowen If the total cores is not divisible by cores per executor, the 
compare app.coresLeft>0 will be always true, so it is better to compare 
app.coresLeft with coresPerExecutor than compare with 0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18322: [SPARK-21115][Core]If the cores left is less than...

2017-06-16 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18322#discussion_r122424679
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala 
---
@@ -658,19 +658,22 @@ private[deploy] class Master(
   private def startExecutorsOnWorkers(): Unit = {
 // Right now this is a very simple FIFO scheduler. We keep trying to 
fit in the first app
 // in the queue, then the second app, etc.
-for (app <- waitingApps if app.coresLeft > 0) {
+for (app <- waitingApps) {
--- End diff --

Is not be better to compare app.coresLeft whih coresPerExecutor? If the 
coresLeft less than coresPerExecutor, it will return directly


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18322: [SPARK-21115][Core]If the cores left is less than...

2017-06-16 Thread eatoncys
Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18322#discussion_r122424299
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala ---
@@ -278,6 +278,15 @@ private[deploy] class SparkSubmitArguments(args: 
Seq[String], env: Map[String, S
 if (pyFiles != null && !isPython) {
   SparkSubmit.printErrorAndExit("--py-files given but primary resource 
is not a Python script")
 }
+if (totalExecutorCores != null && executorCores != null) {
+  val totalCores = Try(totalExecutorCores.toInt).getOrElse(-1)
--- End diff --

Ok, I will remove the Try block, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18322: [SPARK-21115][Core]If the cores left is less than the co...

2017-06-16 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/18322
  
@jerryshao I have added warning logs in SparkSubmit , would you like to 
review it again, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18322: [SPARK-21115][Core]If the cores left is less than the co...

2017-06-16 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/18322
  
@jerryshao Ok, I will add warning logs in SparkSubmit, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18322: [SPARK-21115][Core]If the cores left is less than the co...

2017-06-16 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/18322
  
@jerryshao I have modified the "app.coresLeft>0" to "app.coresLeft >= 
coresPerExecutor.getOrElse(1)".
And another question is :  is it will be better to allocate another 
executor with 1 core for the cores left?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18322: [SPARK-21115][Core]If the cores left is less than the co...

2017-06-16 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/18322
  
@jerryshao The problem is: If we start an app with the param 
--total-executor-cores=4 and spark.executor.cores=3,  the code 
"app.coresLeft>0" is always true in 
"org.apache.spark.deploy.master.startExecutorsOnWorkers" and it will try to 
allocate executor for this app and it will allocate nothing, it is better to 
compare the app.coresLeft whih coresPerExecutor, if the coresLeft less than 
coresPerExecutor, it will return directly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18322: [SPARK-21115][Core]If the cores left is less than the co...

2017-06-16 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/18322
  
@jerryshao I have not see any issue here, and I have tested this again 
using the latest Master code, the problem also exists.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18322: [SPARK-21115][Core]If the cores left is less than...

2017-06-16 Thread eatoncys
GitHub user eatoncys opened a pull request:

https://github.com/apache/spark/pull/18322

[SPARK-21115][Core]If the cores left is less than the coresPerExecutor,the 
cores left will not be allocated, so it should not to check in every schedule

## What changes were proposed in this pull request?
If we start an app with the param --total-executor-cores=4 and 
spark.executor.cores=3, the cores left is always 1, so it will try to allocate 
executors in the function 
org.apache.spark.deploy.master.startExecutorsOnWorkers in every schedule.
Another question is, is it will be better to allocate another executor with 
1 core for the cores left.



## How was this patch tested?
unit test




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/eatoncys/spark leftcores

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18322.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18322


commit 5f5f01fb55ec827d0053de2b574929520ff73406
Author: 10129659 <chen.yans...@zte.com.cn>
Date:   2017-06-16T05:46:47Z

If the cores left is less than the coresPerExecutor,the cores left will not 
be allocated, so it should not to check in every schedule




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >