date:20170916

[GitHub] spark pull request #19256: [SPARK-21338][SQL]implement isCascadingTruncateTa...

2017-09-16 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19256#discussion_r139302993
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/jdbc/AggregatedDialect.scala ---
@@ -41,4 +41,8 @@ private class AggregatedDialect(dialects: 
List[JdbcDialect]) extends JdbcDialect
   override def getJDBCType(dt: DataType): Option[JdbcType] = {
 dialects.flatMap(_.getJDBCType(dt)).headOption
   }
+
+  override def isCascadingTruncateTable(): Option[Boolean] = {
+dialects.flatMap(_.isCascadingTruncateTable).headOption
--- End diff --

Why using the first one?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19256: [SPARK-21338][SQL]implement isCascadingTruncateTable() m...

2017-09-16 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19256
  
**[Test build #81848 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81848/testReport)**
 for PR 19256 at commit 
[`7b9d6fc`](https://github.com/apache/spark/commit/7b9d6fcec61cd72ffb257abf17eb0e9d6c462f9e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16600: [SPARK-19242][SQL] SHOW CREATE TABLE should generate new...

2017-09-16 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16600
  
**[Test build #81849 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81849/testReport)**
 for PR 16600 at commit 
[`dfb25a1`](https://github.com/apache/spark/commit/dfb25a15ee437dd6c09d840efdcaf9d233cb8dd0).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19256: [SPARK-21338][SQL]implement isCascadingTruncateTable() m...

2017-09-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19256
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19253: [SPARK-22037][SQL] Collapse Project if it is the ...

2017-09-16 Thread gengliangwang

Github user gengliangwang closed the pull request at:

https://github.com/apache/spark/pull/19253


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19253: [SPARK-22037][SQL] Collapse Project if it is the child o...

2017-09-16 Thread gengliangwang

Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/19253
  
@gatorsmile @viirya  Thanks. I should close this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-09-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19218
  
@fjh100456 We have priority for three different inputs. Here, you just 
consider one of three. Please also add the extra checks. Hopefully, 
@dongjoon-hyun can help you answer your questions. He just finished the work in 
https://github.com/apache/spark/pull/19055


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2017-09-16 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19218#discussion_r139302763
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -101,6 +101,19 @@ case class InsertIntoHiveTable(
 val tmpLocation = getExternalTmpPath(sparkSession, hadoopConf, 
tableLocation)
 val fileSinkConf = new FileSinkDesc(tmpLocation.toString, tableDesc, 
false)
 
+tableDesc.getOutputFileFormatClassName match {
--- End diff --

Move the whole logics into `saveAsHiveFile`, which is being shared by 
`InsertIntoHiveDirCommand` and `InsertIntoHiveTable`. Both need these logics. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-09-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19218
  
I see. If you set `spark.sql.hive.convertMetastoreParquet` to false, you 
will also hit the issue for non-partitioned table. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12646: [SPARK-14878][SQL] Trim characters string functio...

2017-09-16 Thread kevinyu98

Github user kevinyu98 commented on a diff in the pull request:

https://github.com/apache/spark/pull/12646#discussion_r139302558
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -503,69 +504,307 @@ case class FindInSet(left: Expression, right: 
Expression) extends BinaryExpressi
   override def prettyName: String = "find_in_set"
 }
 
+trait String2TrimExpression extends Expression with ImplicitCastInputTypes 
{
+
+  override def dataType: DataType = StringType
+  override def inputTypes: Seq[AbstractDataType] = 
Seq.fill(children.size)(StringType)
+
+  override def nullable: Boolean = children.exists(_.nullable)
+  override def foldable: Boolean = children.forall(_.foldable)
+}
+
+object StringTrim {
+  def apply(str: Expression, trimStr: Expression) : StringTrim = 
StringTrim(str, Some(trimStr))
+  def apply(str: Expression) : StringTrim = StringTrim(str, None)
+}
+
 /**
- * A function that trim the spaces from both ends for the specified string.
+ * A function that takes a character string, removes the leading and 
trailing characters matching with the characters
+ * in the trim string, returns the new string.
+ * If BOTH and trimStr keywords are not specified, it defaults to remove 
space character from both ends. The trim
+ * function will have one argument, which contains the source string.
+ * If BOTH and trimStr keywords are specified, it trims the characters 
from both ends, and the trim function will have
+ * two arguments, the first argument contains trimStr, the second argument 
contains the source string.
+ * trimStr: A character string to be trimmed from the source string, if it 
has multiple characters, the function
+ * searches for each character in the source string, removes the 
characters from the source string until it
+ * encounters the first non-match character.
+ * BOTH: removes any character from both ends of the source string that 
matches characters in the trim string.
  */
 @ExpressionDescription(
-  usage = "_FUNC_(str) - Removes the leading and trailing space characters 
from `str`.",
+  usage = """
+_FUNC_(str) - Removes the leading and trailing space characters from 
`str`.
+_FUNC_(BOTH trimStr FROM str) - Remove the leading and trailing 
trimString from `str`
+  """,
+  arguments = """
+Arguments:
+  * str - a string expression
+  * trimString - the trim string
+  * BOTH, FROM - these are keyword to specify for trim string from 
both ends of the string
+  """,
   examples = """
 Examples:
   > SELECT _FUNC_('SparkSQL   ');
SparkSQL
+  > SELECT _FUNC_(BOTH 'SL' FROM 'SSparkSQLS');
+   parkSQ
   """)
-case class StringTrim(child: Expression)
-  extends UnaryExpression with String2StringExpression {
+case class StringTrim(
+srcStr: Expression,
+trimStr: Option[Expression] = None)
+  extends String2TrimExpression {
 
-  def convert(v: UTF8String): UTF8String = v.trim()
+  def this (trimStr: Expression, srcStr: Expression) = this(srcStr, 
Option(trimStr))
--- End diff --

done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19258: add MockNetCat

2017-09-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19258
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19258: add MockNetCat

2017-09-16 Thread bluejoe2008

GitHub user bluejoe2008 opened a pull request:

https://github.com/apache/spark/pull/19258

add MockNetCat

## What changes were proposed in this pull request?
I add a MockNetCat class, which avoid manually launch 'nc -lk ' command 
for test
also I put a MockNetCatTest class, a JUnit 4 test case which test MockNetCat

## How was this patch tested?

run MockNetCatTest to test MockNetCat
this class uses an structured streaming example which read socket data from 
localhost:

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/bluejoe2008/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19258.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19258


commit f61636da8998c52ddb980584ed8a9e7aee01902c
Author: bluejoe2...@gmail.com 
Date:   2017-09-17T05:14:37Z

add MockNetCat




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19249: [SPARK-22032] Speed up StructType.fromInternal

2017-09-16 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19249#discussion_r139301428
  
--- Diff: python/pyspark/sql/types.py ---
@@ -619,7 +621,8 @@ def fromInternal(self, obj):
 # it's already converted by pickler
 return obj
 if self._needSerializeAnyField:
-values = [f.fromInternal(v) for f, v in zip(self.fields, obj)]
+values = [f.fromInternal(v) if n else v
--- End diff --

Could we run a benchmark with the worst case, when all columns are needed 
to be converted?  I think here we pay another if and extra element in the zip 
to prevert function call basically. This one looks okay practically but I guess 
we should also identify the downside.

Also, let's add a comment here to describe what we are doing here and also 
add some links to this PR for other guys to refer the benchmarks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19257: [SPARK-22042] [SQL] ReorderJoinPredicates can break when...

2017-09-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19257
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81847/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19257: [SPARK-22042] [SQL] ReorderJoinPredicates can break when...

2017-09-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19257
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19257: [SPARK-22042] [SQL] ReorderJoinPredicates can break when...

2017-09-16 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19257
  
**[Test build #81847 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81847/testReport)**
 for PR 19257 at commit 
[`6ff4ed0`](https://github.com/apache/spark/commit/6ff4ed089777d5ec9edef8e7e53e847ec9379444).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19253: [SPARK-22037][SQL] Collapse Project if it is the child o...

2017-09-16 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19253
  
The removed `Project` might affect the rule `PhysicalOperation` and reduce 
the chance to prune columns when reading relations.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19257: [SPARK-22042] [SQL] ReorderJoinPredicates can break when...

2017-09-16 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19257
  
**[Test build #81847 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81847/testReport)**
 for PR 19257 at commit 
[`6ff4ed0`](https://github.com/apache/spark/commit/6ff4ed089777d5ec9edef8e7e53e847ec9379444).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19257: [SPARK-22042] [SQL] ReorderJoinPredicates can bre...

2017-09-16 Thread tejasapatil

GitHub user tejasapatil opened a pull request:

https://github.com/apache/spark/pull/19257

[SPARK-22042] [SQL] ReorderJoinPredicates can break when child's 
partitioning is not decided

## What changes were proposed in this pull request?

See jira description for the bug : 
https://issues.apache.org/jira/browse/SPARK-22042

Fix done in this PR is:  In `EnsureRequirements`, apply 
`ReorderJoinPredicates` over the input tree before doing its core logic. Since 
the tree is transformed bottom-up, we can assure that the children are resolved 
before doing `ReorderJoinPredicates`. 

Theoretically this will guarantee to cover all such cases while keeping the 
code simple. My small grudge is for cosmetic reasons. This PR will look weird 
given that we don't call rules from other rules (not to my knowledge). I could 
have moved all the logic for `ReorderJoinPredicates` into `EnsureRequirements` 
but that will make it a but crowded. I am happy to discuss if there are better 
options. 

## How was this patch tested?

Added a new test case

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tejasapatil/spark 
SPARK-22042_ReorderJoinPredicates

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19257.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19257


commit 6ff4ed089777d5ec9edef8e7e53e847ec9379444
Author: Tejas Patil 
Date:   2017-09-17T01:10:31Z

[SPARK-22042] [SQL] ReorderJoinPredicates can break when child's 
partitioning is not decided




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19257: [SPARK-22042] [SQL] ReorderJoinPredicates can break when...

2017-09-16 Thread tejasapatil

Github user tejasapatil commented on the issue:

https://github.com/apache/spark/pull/19257
  
Jenkins test this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19255: [WIP][SPARK-22029] Add lru_cache to _parse_datatype_json...

2017-09-16 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19255
  
This PR also needs pref tests and the results in the description.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19255: [WIP][SPARK-22029] Add lru_cache to _parse_dataty...

2017-09-16 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19255#discussion_r139298764
  
--- Diff: python/pyspark/sql/types.py ---
@@ -24,6 +24,7 @@
 import re
 import base64
 from array import array
+from functools import lru_cache
--- End diff --

I think we should disable it in Python 2.x if we are going ahead with this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19256: [SPARK-21338][SQL]implement isCascadingTruncateTable() m...

2017-09-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19256
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19256: [SPARK-21338][SQL]implement isCascadingTruncateTa...

2017-09-16 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/19256

[SPARK-21338][SQL]implement isCascadingTruncateTable() method in Aggrâ¦

â¦egatedDialect

## What changes were proposed in this pull request?

org.apache.spark.sql.jdbc.JdbcDialect's method:
def isCascadingTruncateTable(): Option[Boolean] = None
is not overriden in org.apache.spark.sql.jdbc.AggregatedDialect class.
Because of this issue, when you add more than one dialect Spark doesn't 
truncate table because isCascadingTruncateTable always returns default None for 
Aggregated Dialect. 
Will implement isCascadingTruncateTable in AggregatedDialect class in this 
PR. 

## How was this patch tested?

In JDBCSuite, inside test("Aggregated dialects"), will add one line to test 
AggregatedDialect.isCascadingTruncateTable


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-21338

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19256.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19256






---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18853: [SPARK-21646][SQL] CommonType for binary comparison

2017-09-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18853
  
Thank you for your investigation! I think we need to introduce a type 
inference conf for it. To avoid impacting the existing Spark users, we should 
keep the existing behaviors. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19246: [SPARK-22025] Speeding up fromInternal for StructField

2017-09-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19246
  
Could you mark [PySpark] in the title? cc @ueshin


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19253: [SPARK-22037][SQL] Collapse Project if it is the ...

2017-09-16 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19253#discussion_r139297437
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -541,6 +541,25 @@ object CollapseProject extends Rule[LogicalPlan] {
 agg.copy(aggregateExpressions = buildCleanedProjectList(
   p.projectList, agg.aggregateExpressions))
   }
+case agg @ Aggregate(groupingExpressions, aggregateExpressions, p: 
Project) =>
+  if (haveCommonNonDeterministicOutput(groupingExpressions, 
p.projectList) ||
+haveCommonNonDeterministicOutput(aggregateExpressions, 
p.projectList)) {
--- End diff --

This condition is not right


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19253: [SPARK-22037][SQL] Collapse Project if it is the child o...

2017-09-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19253
  
In whole-stage codegen, this will not gain any perf gain if we can remove 
the useless Project.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19253: [SPARK-22037][SQL] Collapse Project if it is the ...

2017-09-16 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19253#discussion_r139297406
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -541,6 +541,25 @@ object CollapseProject extends Rule[LogicalPlan] {
 agg.copy(aggregateExpressions = buildCleanedProjectList(
   p.projectList, agg.aggregateExpressions))
   }
+case agg @ Aggregate(groupingExpressions, aggregateExpressions, p: 
Project) =>
+  if (haveCommonNonDeterministicOutput(groupingExpressions, 
p.projectList) ||
+haveCommonNonDeterministicOutput(aggregateExpressions, 
p.projectList)) {
--- End diff --

This condition is not right.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19255: [WIP][SPARK-22029] Add lru_cache to _parse_datatype_json...

2017-09-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19255
  
Could you mark [PySpark] in the title? cc @ueshin 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19249: [SPARK-22032] Speed up StructType.fromInternal

2017-09-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19249
  
Could you mark [PySpark] in the title? cc @ueshin 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19234: [SPARK-22010] Change fromInternal method of TimestampTyp...

2017-09-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19234
  
Could you mark [PySpark] in the title? cc @ueshin 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12646: [SPARK-14878][SQL] Trim characters string functio...

2017-09-16 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/12646#discussion_r139297269
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -503,69 +504,307 @@ case class FindInSet(left: Expression, right: 
Expression) extends BinaryExpressi
   override def prettyName: String = "find_in_set"
 }
 
+trait String2TrimExpression extends Expression with ImplicitCastInputTypes 
{
+
+  override def dataType: DataType = StringType
+  override def inputTypes: Seq[AbstractDataType] = 
Seq.fill(children.size)(StringType)
+
+  override def nullable: Boolean = children.exists(_.nullable)
+  override def foldable: Boolean = children.forall(_.foldable)
+}
+
+object StringTrim {
+  def apply(str: Expression, trimStr: Expression) : StringTrim = 
StringTrim(str, Some(trimStr))
+  def apply(str: Expression) : StringTrim = StringTrim(str, None)
+}
+
 /**
- * A function that trim the spaces from both ends for the specified string.
+ * A function that takes a character string, removes the leading and 
trailing characters matching with the characters
+ * in the trim string, returns the new string.
+ * If BOTH and trimStr keywords are not specified, it defaults to remove 
space character from both ends. The trim
+ * function will have one argument, which contains the source string.
+ * If BOTH and trimStr keywords are specified, it trims the characters 
from both ends, and the trim function will have
+ * two arguments, the first argument contains trimStr, the second argument 
contains the source string.
+ * trimStr: A character string to be trimmed from the source string, if it 
has multiple characters, the function
+ * searches for each character in the source string, removes the 
characters from the source string until it
+ * encounters the first non-match character.
+ * BOTH: removes any character from both ends of the source string that 
matches characters in the trim string.
  */
 @ExpressionDescription(
-  usage = "_FUNC_(str) - Removes the leading and trailing space characters 
from `str`.",
+  usage = """
+_FUNC_(str) - Removes the leading and trailing space characters from 
`str`.
+_FUNC_(BOTH trimStr FROM str) - Remove the leading and trailing 
trimString from `str`
+  """,
+  arguments = """
+Arguments:
+  * str - a string expression
+  * trimString - the trim string
+  * BOTH, FROM - these are keyword to specify for trim string from 
both ends of the string
+  """,
   examples = """
 Examples:
   > SELECT _FUNC_('SparkSQL   ');
SparkSQL
+  > SELECT _FUNC_(BOTH 'SL' FROM 'SSparkSQLS');
+   parkSQ
   """)
-case class StringTrim(child: Expression)
-  extends UnaryExpression with String2StringExpression {
+case class StringTrim(
+srcStr: Expression,
+trimStr: Option[Expression] = None)
+  extends String2TrimExpression {
 
-  def convert(v: UTF8String): UTF8String = v.trim()
+  def this (trimStr: Expression, srcStr: Expression) = this(srcStr, 
Option(trimStr))
+
+  def this(srcStr: Expression) = this(srcStr, None)
 
   override def prettyName: String = "trim"
 
+  override def children: Seq[Expression] = if (trimStr.isDefined) {
+srcStr :: trimStr.get :: Nil
+  } else {
+srcStr :: Nil
+  }
+  override def eval(input: InternalRow): Any = {
+val srcString = srcStr.eval(input).asInstanceOf[UTF8String]
+if (srcString == null) {
+  null
+} else {
+  if (trimStr.isDefined) {
+return 
srcString.trim(trimStr.get.eval(input).asInstanceOf[UTF8String])
+  } else {
+return srcString.trim()
+  }
+}
+  }
+
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
-defineCodeGen(ctx, ev, c => s"($c).trim()")
+val evals = children.map(_.genCode(ctx))
+val srcString = evals(0)
+
+if (evals.length == 1) {
+  ev.copy(evals.map(_.code).mkString("\n") + s"""
--- End diff --

sounds good 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19254: [MINOR][CORE] Cleanup dead code and duplication in Mem. ...

2017-09-16 Thread original-brownbear

Github user original-brownbear commented on the issue:

https://github.com/apache/spark/pull/19254
  
Test failure in PySpark appears unrelated, various tests OOMed like e.g.

```sh

FAILED (errors=2)
[Running ]
ERROR

==
ERROR: setUpClass (pyspark.streaming.tests.StreamingListenerTests)
--
Traceback (most recent call last):
  File 
"/home/jenkins/workspace/NewSparkPullRequestBuilder/python/pyspark/streaming/tests.py",
 line 65, in setUpClass
cls.sc = SparkContext(appName=class_name, conf=conf)
  File 
"/home/jenkins/workspace/NewSparkPullRequestBuilder/python/pyspark/context.py", 
line 118, in __init__
conf, jsc, profiler_cls)
  File 
"/home/jenkins/workspace/NewSparkPullRequestBuilder/python/pyspark/context.py", 
line 180, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
  File 
"/home/jenkins/workspace/NewSparkPullRequestBuilder/python/pyspark/context.py", 
line 270, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
  File 
"/home/jenkins/workspace/NewSparkPullRequestBuilder/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py",
 line 1428, in __call__
answer, self._gateway_client, None, self._fqn)
  File 
"/home/jenkins/workspace/NewSparkPullRequestBuilder/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py",
 line 320, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling 
None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.OutOfMemoryError: Java heap space
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19254: [MINOR][CORE] Cleanup dead code and duplication in Mem. ...

2017-09-16 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19254
  
**[Test build #3922 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3922/testReport)**
 for PR 19254 at commit 
[`84094a0`](https://github.com/apache/spark/commit/84094a0c75727158f96a942d8d80cc3f5721b955).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19255: [WIP][SPARK-22029] Add lru_cache to _parse_datatype_json...

2017-09-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19255
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19255: [WIP][SPARK-22029] Add lru_cache to _parse_datatype_json...

2017-09-16 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19255
  
**[Test build #81846 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81846/testReport)**
 for PR 19255 at commit 
[`c903860`](https://github.com/apache/spark/commit/c903860ee8d25afda0f969b582bdbdaa0aa8c9fe).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19255: [WIP][SPARK-22029] Add lru_cache to _parse_datatype_json...

2017-09-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19255
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81846/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19255: [WIP][SPARK-22029] Add lru_cache to _parse_datatype_json...

2017-09-16 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19255
  
**[Test build #81846 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81846/testReport)**
 for PR 19255 at commit 
[`c903860`](https://github.com/apache/spark/commit/c903860ee8d25afda0f969b582bdbdaa0aa8c9fe).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19255: [WIP][SPARK-22029] Add lru_cache to _parse_dataty...

2017-09-16 Thread maver1ck

Github user maver1ck commented on a diff in the pull request:

https://github.com/apache/spark/pull/19255#discussion_r139292791
  
--- Diff: python/pyspark/sql/types.py ---
@@ -24,6 +24,7 @@
 import re
 import base64
 from array import array
+from functools import lru_cache
--- End diff --

Any ideas for Python 2.7 ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19255: [WIP][SPARK-22029] Add lru_cache to _parse_dataty...

2017-09-16 Thread maver1ck

GitHub user maver1ck opened a pull request:

https://github.com/apache/spark/pull/19255

[WIP][SPARK-22029] Add lru_cache to _parse_datatype_json_string

## What changes were proposed in this pull request?

_parse_datatype_json_string is called many times for the same datatypes.
By cacheing its result we can speed up pySpark internals.

## How was this patch tested?

Existing tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/maver1ck/spark spark_22029

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19255.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19255


commit c903860ee8d25afda0f969b582bdbdaa0aa8c9fe
Author: Maciej BryÅski 
Date:   2017-09-16T18:51:49Z

Add lru_cache to _parse_datatype_json_string




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19246: [SPARK-22025] Speeding up fromInternal for Struct...

2017-09-16 Thread maver1ck

Github user maver1ck commented on a diff in the pull request:

https://github.com/apache/spark/pull/19246#discussion_r139291502
  
--- Diff: python/pyspark/sql/types.py ---
@@ -410,6 +410,24 @@ def __init__(self, name, dataType, nullable=True, 
metadata=None):
 self.dataType = dataType
 self.nullable = nullable
 self.metadata = metadata or {}
+self.needConversion = dataType.needConversion
+self.toInternal = dataType.toInternal
+self.fromInternal = dataType.fromInternal
+
+def __getstate__(self):
--- End diff --

We need to handle pickle by ourselves because we have fields with function 
values


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19249: [SPARK-22032] Speed up StructType.fromInternal

2017-09-16 Thread maver1ck

Github user maver1ck commented on the issue:

https://github.com/apache/spark/pull/19249
  
I added benchmark for this code


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19254: [MINOR][CORE] Cleanup dead code and duplication in Mem. ...

2017-09-16 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19254
  
**[Test build #3922 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3922/testReport)**
 for PR 19254 at commit 
[`84094a0`](https://github.com/apache/spark/commit/84094a0c75727158f96a942d8d80cc3f5721b955).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19234: [SPARK-22010] Change fromInternal method of TimestampTyp...

2017-09-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19234
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81845/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19234: [SPARK-22010] Change fromInternal method of TimestampTyp...

2017-09-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19234
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19234: [SPARK-22010] Change fromInternal method of TimestampTyp...

2017-09-16 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19234
  
**[Test build #81845 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81845/testReport)**
 for PR 19234 at commit 
[`bed6193`](https://github.com/apache/spark/commit/bed6193ad1d57a0f7873d1a7dccd6257e15a7dab).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18323: [SPARK-21117][SQL] Built-in SQL Function Support - WIDTH...

2017-09-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18323
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81843/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18323: [SPARK-21117][SQL] Built-in SQL Function Support - WIDTH...

2017-09-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18323
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18323: [SPARK-21117][SQL] Built-in SQL Function Support - WIDTH...

2017-09-16 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18323
  
**[Test build #81843 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81843/testReport)**
 for PR 18323 at commit 
[`2e2b2ca`](https://github.com/apache/spark/commit/2e2b2ca39ffb595ec5c26bcec71afa9df8a612c6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19234: [SPARK-22010] Change fromInternal method of TimestampTyp...

2017-09-16 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19234
  
**[Test build #81845 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81845/testReport)**
 for PR 19234 at commit 
[`bed6193`](https://github.com/apache/spark/commit/bed6193ad1d57a0f7873d1a7dccd6257e15a7dab).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19234: [SPARK-22010] Change fromInternal method of Times...

2017-09-16 Thread maver1ck

Github user maver1ck commented on a diff in the pull request:

https://github.com/apache/spark/pull/19234#discussion_r139290042
  
--- Diff: python/pyspark/sql/types.py ---
@@ -196,7 +199,9 @@ def toInternal(self, dt):
 def fromInternal(self, ts):
 if ts is not None:
 # using int to avoid precision loss in float
-return datetime.datetime.fromtimestamp(ts // 
100).replace(microsecond=ts % 100)
+y, m, d, hh, mm, ss, _, _, _ = (time.gmtime(ts // 100) if 
_is_utc
+else time.localtime(ts // 
100))
+return datetime.datetime(y, m, d, hh, mm, ss, ts % 100)
--- End diff --

I added some description and support for leap seconds


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDF...

2017-09-16 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18659#discussion_r139289721
  
--- Diff: python/pyspark/serializers.py ---
@@ -199,6 +211,33 @@ def __repr__(self):
 return "ArrowSerializer"
 
 
+class ArrowPandasSerializer(ArrowSerializer):
+
+def __init__(self):
+super(ArrowPandasSerializer, self).__init__()
+
+def dumps(self, series):
+"""
+Make an ArrowRecordBatch from a Pandas Series and serialize
+"""
+import pyarrow as pa
--- End diff --

I am okay with leaving it as is here. I think we should catch and throw it 
with better messages in all cases later but let's talk about this in another 
place.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDF...

2017-09-16 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18659#discussion_r139289612
  
--- Diff: python/pyspark/serializers.py ---
@@ -199,6 +211,33 @@ def __repr__(self):
 return "ArrowSerializer"
 
 
+class ArrowPandasSerializer(ArrowSerializer):
+
+def __init__(self):
+super(ArrowPandasSerializer, self).__init__()
+
+def dumps(self, series):
+"""
+Make an ArrowRecordBatch from a Pandas Series and serialize
+"""
+import pyarrow as pa
--- End diff --

Ah, hm .. let me check the previous discussions and think about this a bit 
more.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDF...

2017-09-16 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18659#discussion_r139289505
  
--- Diff: python/pyspark/serializers.py ---
@@ -199,6 +211,33 @@ def __repr__(self):
 return "ArrowSerializer"
 
 
+class ArrowPandasSerializer(ArrowSerializer):
+
+def __init__(self):
+super(ArrowPandasSerializer, self).__init__()
+
+def dumps(self, series):
+"""
+Make an ArrowRecordBatch from a Pandas Series and serialize
+"""
+import pyarrow as pa
--- End diff --

Ah, I see. Thanks for explaining it. Sure, I am okay with leaving it as is 
here.
I think we should catch and throw it with better messages in all cases 
(probably in the entry points) but probably let's talk about this later in 
another place.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19253: [SPARK-22037][SQL] Collapse Project if it is the child o...

2017-09-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19253
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81844/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19253: [SPARK-22037][SQL] Collapse Project if it is the child o...

2017-09-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19253
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19253: [SPARK-22037][SQL] Collapse Project if it is the child o...

2017-09-16 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19253
  
**[Test build #81844 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81844/testReport)**
 for PR 19253 at commit 
[`525d654`](https://github.com/apache/spark/commit/525d6546b06173feda4c5806176ebb98c825dcaa).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19252: [SPARK-21969][SQL] CommandUtils.updateTableStats should ...

2017-09-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19252
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81842/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19252: [SPARK-21969][SQL] CommandUtils.updateTableStats should ...

2017-09-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19252
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19252: [SPARK-21969][SQL] CommandUtils.updateTableStats should ...

2017-09-16 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19252
  
**[Test build #81842 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81842/testReport)**
 for PR 19252 at commit 
[`ba963b4`](https://github.com/apache/spark/commit/ba963b46cd2917315bc2bd0cf237c7d9f79e9d65).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19254: [MINOR][CORE] Cleanup dead code and duplication in Mem. ...

2017-09-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19254
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19230: [SPARK-22003][SQL] support array column in vectorized re...

2017-09-16 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19230
  
LGTM too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19254: [MINOR][CORE] Cleanup dead code and duplication i...

2017-09-16 Thread original-brownbear

GitHub user original-brownbear opened a pull request:

https://github.com/apache/spark/pull/19254

[MINOR][CORE] Cleanup dead code and duplication in Mem. Management

## What changes were proposed in this pull request?

* Removed the method 
`org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter#alignToWords`.
It became unused as a result of 85b0a157543201895557d66306b38b3ca52f2151
(SPARK-15962) introducing word alignment for unsafe arrays.
* Cleaned up duplicate code in memory management and unsafe sorters
  * The change extracting the exception paths is more than just cosmetics 
since it def. reduces the size the affected methods compile to

## How was this patch tested?

* Build still passes after removing the method, grepping the codebase for 
`alignToWords` shows no reference to it anywhere either.
* Dried up code is covered by existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/original-brownbear/spark cleanup-mem-consumer

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19254.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19254


commit 84094a0c75727158f96a942d8d80cc3f5721b955
Author: Armin 
Date:   2017-09-16T13:34:23Z

[MINOR][CORE] Cleanup dead code and duplication in mem. management

* Removed the method 
`org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter#alignToWords`.
It became unused as a result of 85b0a157543201895557d66306b38b3ca52f2151
(SPARK-15962) introducing word alignment for unsafe arrays.
* Cleaned up duplicate code in memory management and unsafe sorters
  * The change extracting the exception paths is more than just cosmetics 
since it def. reduces the size the affected methods compile to

* Build still passes after removing the method, grepping the codebase for 
`alignToWords` shows no reference to it anywhere either.
* Dried up code is covered by existing tests.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19230: [SPARK-22003][SQL] support array column in vectorized re...

2017-09-16 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19230
  
LGTM except some minor comments


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19179: [TRIVIAL][SQL] Cleanup Todo for removal of org.ap...

2017-09-16 Thread original-brownbear

Github user original-brownbear closed the pull request at:

https://github.com/apache/spark/pull/19179


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19230: [SPARK-22003][SQL] support array column in vector...

2017-09-16 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19230#discussion_r139287957
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnVectorSuite.scala
 ---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.vectorized
+
+import org.scalatest.BeforeAndAfterEach
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.util.ArrayData
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+class ColumnVectorSuite extends SparkFunSuite with BeforeAndAfterEach {
+
+  var testVector: WritableColumnVector = _
+
+  private def allocate(capacity: Int, dt: DataType): WritableColumnVector 
= {
+new OnHeapColumnVector(capacity, dt)
+  }
+
+  override def afterEach(): Unit = {
+testVector.close()
+  }
+
+  test("boolean") {
+testVector = allocate(10, BooleanType)
+(0 until 10).foreach { i =>
+  testVector.appendBoolean(i % 2 == 0)
+}
+
+val array = new ColumnVector.Array(testVector)
+
+(0 until 10).foreach { i =>
+  assert(array.get(i, BooleanType) === (i % 2 == 0))
+}
+  }
+
+  test("byte") {
+testVector = allocate(10, ByteType)
+(0 until 10).foreach { i =>
+  testVector.appendByte(i.toByte)
+}
+
+val array = new ColumnVector.Array(testVector)
+
+(0 until 10).foreach { i =>
+  assert(array.get(i, ByteType) === (i.toByte))
+}
+  }
+
+  test("short") {
+testVector = allocate(10, ShortType)
+(0 until 10).foreach { i =>
+  testVector.appendShort(i.toShort)
+}
+
+val array = new ColumnVector.Array(testVector)
+
+(0 until 10).foreach { i =>
+  assert(array.get(i, ShortType) === (i.toShort))
+}
+  }
+
+  test("int") {
+testVector = allocate(10, IntegerType)
+(0 until 10).foreach { i =>
+  testVector.appendInt(i)
+}
+
+val array = new ColumnVector.Array(testVector)
+
+(0 until 10).foreach { i =>
+  assert(array.get(i, IntegerType) === i)
+}
+  }
+
+  test("long") {
+testVector = allocate(10, LongType)
+(0 until 10).foreach { i =>
+  testVector.appendLong(i)
+}
+
+val array = new ColumnVector.Array(testVector)
+
+(0 until 10).foreach { i =>
+  assert(array.get(i, LongType) === i)
+}
+  }
+
+  test("float") {
+testVector = allocate(10, FloatType)
+(0 until 10).foreach { i =>
+  testVector.appendFloat(i.toFloat)
+}
+
+val array = new ColumnVector.Array(testVector)
+
+(0 until 10).foreach { i =>
+  assert(array.get(i, FloatType) === i.toFloat)
+}
+  }
+
+  test("double") {
+testVector = allocate(10, DoubleType)
+(0 until 10).foreach { i =>
+  testVector.appendDouble(i.toDouble)
+}
+
+val array = new ColumnVector.Array(testVector)
+
+(0 until 10).foreach { i =>
+  assert(array.get(i, DoubleType) === i.toDouble)
+}
+  }
+
+  test("string") {
+testVector = allocate(10, StringType)
+(0 until 10).map { i =>
+  val utf8 = s"str$i".getBytes("utf8")
+  testVector.appendByteArray(utf8, 0, utf8.length)
+}
+
+val array = new ColumnVector.Array(testVector)
+
+(0 until 10).foreach { i =>
+  assert(array.get(i, StringType) === UTF8String.fromString(s"str$i"))
+}
+  }
+
+  test("binary") {
+testVector = allocate(10, BinaryType)
+(0 until 10).map { i =>
+  val utf8 = s"str$i".getBytes("utf8")
+  testVector.appendByteArray(utf8, 0, utf8.length)
+}
+
+val array = new ColumnVector.Array(testVector)
+
+(

[GitHub] spark pull request #19230: [SPARK-22003][SQL] support array column in vector...

2017-09-16 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19230#discussion_r139287815
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVectorUtils.java
 ---
@@ -158,7 +158,7 @@ private static void appendValue(WritableColumnVector 
dst, DataType t, Object o)
 dst.getChildColumn(0).appendInt(c.months);
 dst.getChildColumn(1).appendLong(c.microseconds);
   } else if (t instanceof DateType) {
-dst.appendInt(DateTimeUtils.fromJavaDate((Date)o));
+dst.appendInt((int) DateTimeUtils.fromJavaDate((Date)o));
--- End diff --

is it necessary?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19253: [SPARK-22037][SQL] Collapse Project if it is the child o...

2017-09-16 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19253
  
**[Test build #81844 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81844/testReport)**
 for PR 19253 at commit 
[`525d654`](https://github.com/apache/spark/commit/525d6546b06173feda4c5806176ebb98c825dcaa).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19253: [SPARK-22037][SQL] Collapse Project if it is the ...

2017-09-16 Thread gengliangwang

GitHub user gengliangwang opened a pull request:

https://github.com/apache/spark/pull/19253

[SPARK-22037][SQL] Collapse Project if it is the child of Aggregate

## What changes were proposed in this pull request?

If Aggregate's child is Project, collapse the Project into the Aggregate.
This rule is to make query plans simpler.

## How was this patch tested?

Unit test

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gengliangwang/spark 
CollapseProjectIntoAggregate

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19253.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19253


commit 525d6546b06173feda4c5806176ebb98c825dcaa
Author: Wang Gengliang 
Date:   2017-09-16T13:45:10Z

Collapse Project if it is the child of Aggregate




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18323: [SPARK-21117][SQL] Built-in SQL Function Support - WIDTH...

2017-09-16 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18323
  
**[Test build #81843 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81843/testReport)**
 for PR 18323 at commit 
[`2e2b2ca`](https://github.com/apache/spark/commit/2e2b2ca39ffb595ec5c26bcec71afa9df8a612c6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-09-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19218
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-09-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19218
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81841/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-09-16 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19218
  
**[Test build #81841 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81841/testReport)**
 for PR 19218 at commit 
[`c7ff62c`](https://github.com/apache/spark/commit/c7ff62cc1d4e35f7ffd2f711068a2ed5dc47e406).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19249: [SPARK-22032] Speed up StructType.fromInternal

2017-09-16 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19249#discussion_r139286221
  
--- Diff: python/pyspark/sql/types.py ---
@@ -619,7 +621,8 @@ def fromInternal(self, obj):
 # it's already converted by pickler
 return obj
 if self._needSerializeAnyField:
-values = [f.fromInternal(v) for f, v in zip(self.fields, obj)]
+values = [f.fromInternal(v) if n else v
--- End diff --

Let's describe this in more details and add some numbers (and in your other 
PRs too).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19249: [SPARK-22032] Speed up StructType.fromInternal

2017-09-16 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19249#discussion_r139286195
  
--- Diff: python/pyspark/sql/types.py ---
@@ -619,7 +621,8 @@ def fromInternal(self, obj):
 # it's already converted by pickler
 return obj
 if self._needSerializeAnyField:
-values = [f.fromInternal(v) for f, v in zip(self.fields, obj)]
+values = [f.fromInternal(v) if n else v
--- End diff --

Ah, I see. This can be recursive and per-record and we avoid here by 
pre-computing. I see. That makes much sense.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19249: [SPARK-22032] Speed up StructType.fromInternal

2017-09-16 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19249
  
Okay, then let's go ahead then. Let'd add some numbers in the PR 
description.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19249: [SPARK-22032] Speed up StructType.fromInternal

2017-09-16 Thread maver1ck

Github user maver1ck commented on the issue:

https://github.com/apache/spark/pull/19249
  
Yep.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18685: [SPARK-21439] Support for ABCMeta in PySpark

2017-09-16 Thread maver1ck

Github user maver1ck commented on the issue:

https://github.com/apache/spark/pull/18685
  
Ping received.
I'll try to add tests and resolve conflict


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19249: [SPARK-22032] Speed up StructType.fromInternal

2017-09-16 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19249
  
Did it save 6~7% of the total execution time?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19249: [SPARK-22032] Speed up StructType.fromInternal

2017-09-16 Thread maver1ck

Github user maver1ck commented on the issue:

https://github.com/apache/spark/pull/19249
  
I was checking this with my production code.
This give me about 6-7% of speed up and remove 408 millions of function 
calls :)

I'll try to create benchmark for this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19252: [SPARK-21969][SQL] CommandUtils.updateTableStats should ...

2017-09-16 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19252
  
**[Test build #81842 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81842/testReport)**
 for PR 19252 at commit 
[`ba963b4`](https://github.com/apache/spark/commit/ba963b46cd2917315bc2bd0cf237c7d9f79e9d65).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19249: [SPARK-22032] Speed up StructType.fromInternal

2017-09-16 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19249
  
To be honest, it looks too trivial that I won't bother.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19252: [SPARK-21969][SQL] CommandUtils.updateTableStats ...

2017-09-16 Thread aokolnychyi

GitHub user aokolnychyi opened a pull request:

https://github.com/apache/spark/pull/19252

[SPARK-21969][SQL] CommandUtils.updateTableStats should call refreshTable

## What changes were proposed in this pull request?

Tables in the catalog cache are not invalidated once their statistics are 
updated. As a consequence, existing sessions will use the cached information 
even though it is not valid anymore. Consider and an example below. 

```
// step 1
spark.range(100).write.saveAsTable("tab1")
// step 2
spark.sql("analyze table tab1 compute statistics")
// step 3
spark.sql("explain cost select distinct * from tab1").show(false)
// step 4
spark.range(100).write.mode("append").saveAsTable("tab1")
// step 5
spark.sql("explain cost select distinct * from tab1").show(false)
```

After step 3, the table will be present in the catalog relation cache. Step 
4 will correctly update the metadata inside the catalog but will NOT invalidate 
the cache.

By the way, ``spark.sql("analyze table tab1 compute statistics")`` between 
step 3 and step 4 would also solve the problem.

## How was this patch tested?

Current and additional unit tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aokolnychyi/spark spark-21969

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19252.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19252


commit ba963b46cd2917315bc2bd0cf237c7d9f79e9d65
Author: aokolnychyi 
Date:   2017-09-16T11:57:52Z

[SPARK-21969][SQL] CommandUtils.updateTableStats should call refreshTable




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19249: [SPARK-22032] Speed up StructType.fromInternal

2017-09-16 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19249
  
How many functions call does it save and much it improves? I'd not bother 
fixing this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19234: [SPARK-22010] Change fromInternal method of Times...

2017-09-16 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19234#discussion_r139285021
  
--- Diff: python/pyspark/sql/types.py ---
@@ -196,7 +199,9 @@ def toInternal(self, dt):
 def fromInternal(self, ts):
 if ts is not None:
 # using int to avoid precision loss in float
-return datetime.datetime.fromtimestamp(ts // 
100).replace(microsecond=ts % 100)
+y, m, d, hh, mm, ss, _, _, _ = (time.gmtime(ts // 100) if 
_is_utc
+else time.localtime(ts // 
100))
+return datetime.datetime(y, m, d, hh, mm, ss, ts % 100)
--- End diff --

BTW, the change itself seems fine to me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19234: [SPARK-22010] Change fromInternal method of Times...

2017-09-16 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19234#discussion_r139284915
  
--- Diff: python/pyspark/sql/types.py ---
@@ -196,7 +199,9 @@ def toInternal(self, dt):
 def fromInternal(self, ts):
 if ts is not None:
 # using int to avoid precision loss in float
-return datetime.datetime.fromtimestamp(ts // 
100).replace(microsecond=ts % 100)
+y, m, d, hh, mm, ss, _, _, _ = (time.gmtime(ts // 100) if 
_is_utc
+else time.localtime(ts // 
100))
+return datetime.datetime(y, m, d, hh, mm, ss, ts % 100)
--- End diff --

Could we have more details in the PR description? I can follow what it 
fixes but I believe we should better describe details for other guys to follow 
the history in the future. To me, it has been always hard to track such changes 
and ensure the root cause of a bug.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19246: [SPARK-22025] Speeding up fromInternal for StructField

2017-09-16 Thread maver1ck

Github user maver1ck commented on the issue:

https://github.com/apache/spark/pull/19246
  
@dongjoon-hyun 
I'll do it on Monday.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19234: [SPARK-22010] Change fromInternal method of Times...

2017-09-16 Thread maver1ck

Github user maver1ck commented on a diff in the pull request:

https://github.com/apache/spark/pull/19234#discussion_r139284824
  
--- Diff: python/pyspark/sql/types.py ---
@@ -196,7 +199,9 @@ def toInternal(self, dt):
 def fromInternal(self, ts):
 if ts is not None:
 # using int to avoid precision loss in float
-return datetime.datetime.fromtimestamp(ts // 
100).replace(microsecond=ts % 100)
+y, m, d, hh, mm, ss, _, _, _ = (time.gmtime(ts // 100) if 
_is_utc
+else time.localtime(ts // 
100))
+return datetime.datetime(y, m, d, hh, mm, ss, ts % 100)
--- End diff --

I think the only difference is this `ss = min(ss, 59)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19234: [SPARK-22010] Change fromInternal method of Times...

2017-09-16 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19234#discussion_r139284793
  
--- Diff: python/pyspark/sql/types.py ---
@@ -196,7 +199,9 @@ def toInternal(self, dt):
 def fromInternal(self, ts):
 if ts is not None:
 # using int to avoid precision loss in float
-return datetime.datetime.fromtimestamp(ts // 
100).replace(microsecond=ts % 100)
+y, m, d, hh, mm, ss, _, _, _ = (time.gmtime(ts // 100) if 
_is_utc
+else time.localtime(ts // 
100))
+return datetime.datetime(y, m, d, hh, mm, ss, ts % 100)
--- End diff --

I'd explain how the current change is equivalent to `fromtimestamp` in the 
PR description.


https://github.com/python/cpython/blob/018d353c1c8c87767d2335cd884017c2ce12e045/Lib/datetime.py#L1425-L1458


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-09-16 Thread fjh100456

Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/19218
  
@dongjoon-hyun 

A problem has been encountered, There are two ways to specify the 
compression format:
1. CREATE TABLE Test(id int) STORED AS ORC TBLPROPERTIES 
('orc.compress'='SNAPPY');
2. set orc.compress=ZLIB;
If the table already has been specified a compressed format when it was 
created, and then specified another compression format by setting 
'orc.compress', the latter will take effect.

So whether the spark side should not have the default value, we can 
distinguish by 'undefined'; or discard this change, and explain in the document 
that 'spark.sql.parquet.compression.codec' for partitioned tables does not take 
effect, and 'spark.sql.orc.compression.codec ' is not valid for hive tables. Or 
your other better solution.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-09-16 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19218
  
**[Test build #81841 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81841/testReport)**
 for PR 19218 at commit 
[`c7ff62c`](https://github.com/apache/spark/commit/c7ff62cc1d4e35f7ffd2f711068a2ed5dc47e406).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19251: [SPARK-22035][SQL]the value of statistical logicalPlan.s...

2017-09-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19251
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19251: [SPARK-22035][SQL]the value of statistical logica...

2017-09-16 Thread heary-cao

GitHub user heary-cao opened a pull request:

https://github.com/apache/spark/pull/19251

[SPARK-22035][SQL]the value of statistical logicalPlan.stats.sizeInBytes 
which is not expected

## What changes were proposed in this pull request?

Currently, assume there will be the same number of rows as child has.  
statistics `logicalPlan.stats.SizeInBytes` is calculated based on the 
percentage of the size of the child's data type and the size of the current 
data type size.   But there is a problem. Statistics are not very accurate. for 
example:
 ```
val N = 1 << 3
val df = spark.range(N).selectExpr("id as k1",
  s"cast(id  % 3 as string) as idString1",
  s"cast((id + 1) % 5 as string) as idString3")
val sizeInBytes = df.logicalPlan.stats.sizeInBytes
println("sizeInBytes : " + sizeInBytes)
```
before modifyï¼
sizeInBytes is 224(8 * 8 * ( (8 + 20 + 20 +8) / (8 + 8))).  
debug information in ` SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode `
```
p.child.dataType: LongType defaultSize: 8
p.dataType: LongType defaultSize: 8
p.dataType: StringType defaultSize: 20
p.dataType: StringType defaultSize: 20
childRowSize: 16 outputRowSize: 56
p.child.stats.sizeInBytes : 64
p.stats.sizeInBytes : 224
sizeInBytes: 224
```

after modify:
sizeInBytes  is 384( 8 * 8 ((8 + 20 + 20) / 8) ).
debug information in ` SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode `
```
p.child.dataType: LongType defaultSize: 8
p.dataType: LongType defaultSize: 8
p.dataType: StringType defaultSize: 20
p.dataType: StringType defaultSize: 20
childRowSize: 8 outputRowSize: 48
p.child.stats.sizeInBytes : 64
p.stats.sizeInBytes : 384
sizeInBytes: 384
```

this PR fix it.

## How was this patch tested?

add new test cases.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/heary-cao/spark sizeInBytes

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19251.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19251


commit 99d48e5045775853ed7f487c94522e43f98a867a
Author: caoxuewen 
Date:   2017-09-16T10:11:46Z

improved statistics logicalPlan.stats.sizeInBytes




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18936: [SPARK-21688][ML][MLLIB] make native BLAS the first choi...

2017-09-16 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/18936
  
BTW I do think this is a promising idea. I'd welcome more info about the 
performance implications, but if it seems like a net win for most users we 
should do it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-09-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19218
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-09-16 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19218
  
**[Test build #81840 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81840/testReport)**
 for PR 19218 at commit 
[`732266c`](https://github.com/apache/spark/commit/732266cc77eb370442dce0f125a70d634b3ebc6d).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-09-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19218
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81840/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-09-16 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19218
  
**[Test build #81840 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81840/testReport)**
 for PR 19218 at commit 
[`732266c`](https://github.com/apache/spark/commit/732266cc77eb370442dce0f125a70d634b3ebc6d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19248: [SPARK-22027] Add missing explanation of default ...

2017-09-16 Thread exKAZUu

Github user exKAZUu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19248#discussion_r139282831
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala ---
@@ -44,7 +44,7 @@ private[ml] trait HasRegParam extends Params {
 private[ml] trait HasMaxIter extends Params {
 
   /**
-   * Param for maximum number of iterations (>= 0).
+   * Param for maximum number of iterations (>= 0). (default = 20)
--- End diff --

I see. Sorry for my misunderstanding. I will close this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 111 matches

Mail list logo