[GitHub] spark issue #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's inpu...

2018-10-24 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22775
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22773: [SPARK-25785][SQL] Add prettyNames for from_json, to_jso...

2018-10-24 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22773
  
My impression so far was that we note things at migration notes when they 
are improvements (not bugs), and non-trivial and related to backward 
compatibility.

Shall we clarify what to document at migration guide? Otherwise we should 
document everything related with, let's say, external changes, deprecation 
removal, and many other changes related with column names.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22773: [SPARK-25785][SQL] Add prettyNames for from_json, to_jso...

2018-10-24 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22773
  
BTW, it's closer to bug rather then improvement tho. `from_json` should 
have default name `from_json` rather then `jsontostructs` - end users would 
have no idea why it's called `jsontostructs`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22773: [SPARK-25785][SQL] Add prettyNames for from_json, to_jso...

2018-10-24 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22773
  
That's the exact issue I raised before and we ended up with not keeping the 
compatibility in column names. @cloud-fan and @hvanhovell.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22795: [SPARK-25798][PYTHON] Internally document type conversio...

2018-10-24 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22795
  
Thanks @viirya!!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22816: [SPARK-25822][PySpark]Fix a race condition when r...

2018-10-24 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22816#discussion_r228012237
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
@@ -114,7 +114,7 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
 
 context.addTaskCompletionListener[Unit] { _ =>
   writerThread.shutdownOnTaskCompletion()
-  if (!reuseWorker || !released.get) {
+  if (!reuseWorker || released.compareAndSet(false, true)) {
--- End diff --

I think `released` variable here indices if `releasePythonWorker` is 
actually called or not before. Shall we rename and update the comments at  `val 
released`? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22816: [SPARK-25822][PySpark]Fix a race condition when r...

2018-10-24 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22816#discussion_r228012035
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
@@ -114,7 +114,7 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
 
 context.addTaskCompletionListener[Unit] { _ =>
   writerThread.shutdownOnTaskCompletion()
-  if (!reuseWorker || !released.get) {
+  if (!reuseWorker || released.compareAndSet(false, true)) {
--- End diff --

I think `released` variable here indices if `releasePythonWorker` is 
actually called or not. Do you mind if I ask to elaborate why this `released` 
it set to `true` here as well?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

2018-10-24 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22237
  
Thanks all!!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22795: [SPARK-25798][PYTHON] Internally document type conversio...

2018-10-24 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22795
  
Thanks, @BryanCutler.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-24 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22144
  
Adding it as a known issue sounds reasonable to me as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22807: [WIP][SPARK-25811][PySpark] Raise a proper error ...

2018-10-24 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22807#discussion_r227784077
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -4961,6 +4961,31 @@ def foofoo(x, y):
 ).collect
 )
 
+def test_pandas_udf_detect_unsafe_type_conversion(self):
+from distutils.version import LooseVersion
+from pyspark.sql.functions import pandas_udf
+import pandas as pd
+import numpy as np
+import pyarrow as pa
+
+values = [1.0] * 3
+pdf = pd.DataFrame({'A': values})
+df = self.spark.createDataFrame(pdf).repartition(1)
+
+@pandas_udf(returnType="int")
+def udf(column):
+return pd.Series(np.linspace(0, 1, 3))
+
+udf_boolean = df.select(['A']).withColumn('udf', udf('A'))
+
+# Since 0.11.0, PyArrow supports the feature to raise an error for 
unsafe cast.
+if LooseVersion(pa.__version__) >= LooseVersion("0.11.0"):
--- End diff --

BTW, let's bump up the minimal required PyArrow and Pandas version up if 
possible at 3.0 :-)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22730: [SPARK-16775][CORE] Remove deprecated accumulator v1 API...

2018-10-24 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22730
  
adding @cloud-fan since accumulator version 2 was added by you.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

2018-10-23 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22795#discussion_r227634476
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -3023,6 +3023,42 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
 conversion on returned data. The conversion is not guaranteed to 
be correct and results
 should be checked for accuracy by users.
 """
+
+# The following table shows most of Pandas data and SQL type 
conversions in Pandas UDFs that
+# are not yet visible to the user. Some of behaviors are buggy and 
might be changed in the near
+# future. The table might have to be eventually documented externally.
+# Please see SPARK-25798's PR to see the codes in order to generate 
the table below.
+#
+# 
+-+--+--+---+++++-+-+-++++---+-+-++-+-+-+--+---++
  # noqa
+# |SQL Type \ Pandas 
Value(Type)|None(object(NoneType))|True(bool)|1(int8)|1(int16)|
1(int32)|
1(int64)|1(uint8)|1(uint16)|1(uint32)|1(uint64)|1.0(float16)|1.0(float32)|1.0(float64)|1970-01-01
 00:00:00(datetime64[ns])|1970-01-01 00:00:00-05:00(datetime64[ns, 
US/Eastern])|a(object(string))|  1(object(Decimal))|[1 2 
3](object(array[int32]))|1.0(float128)|(1+0j)(complex64)|(1+0j)(complex128)|A(category)|1
 days 00:00:00(timedelta64[ns])|  # noqa
+# 
+-+--+--+---+++++-+-+-++++---+-+-++-+-+-+--+---++
  # noqa
+# |  boolean|  None|  True|   
True|True|True|True|True| True| 
True| True|   False|   False|   False|  
False|False|
X|   X|X|False|
False| False|  X|   False|  # noqa
+# |  tinyint|  None| 1|
  1|   1|   1|   1|   X|X|  
  X|X|   1|   1|   1|   
   X|X|
X|   X|X|X| 
   X| X|  0|   X|  # noqa
+# | smallint|  None| 1|
  1|   1|   1|   1|   1|X|  
  X|X|   1|   1|   1|   
   X|X|
X|   X|X|X| 
   X| X|  X|   X|  # noqa
+# |  int|  None| 1|
  1|   1|   1|   1|   1|1|  
  X|X|   1|   1|   1|   
   X|X|
X|   X|X|X| 
   X| X|  X|   X|  # noqa
+# |   bigint|  None| 1|
  1|   1|   1|   1|   1|1|  
  1|X|   1|   1|   1|   
   0|   18|
X|   X|X|X| 
   X| X|  X|   X|  # noqa
+# |float|  None|   1.0|
1.0| 1.0| 1.0| 1.0| 1.0|  1.0|  
1.0|  1.0| 1.0| 1.0| 1.0| 

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

2018-10-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22237
  
 https://github.com/apache/spark/pull/22237/files#r223707899 makes sense to 
me. Addressed. LGTM from my side as well


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22144
  
> According to the policy, we don't have to block the current release 
because of i

@cloud-fan, BTW, would you mind if I ask to share what you read? I want to 
be aware of the policy as well. Maybe do you mean correctness issue and data 
lose mentioned at https://spark.apache.org/contributing.html? They can be 
considered as blockers?

Blocker means: pointless to release without this change as the release 
would be unusable to a large minority of users. Correctness and data loss 
issues should be considered Blockers.

It's difficult to say but at least looks not so few of users.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

2018-10-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22514
  
@cloud-fan, is this a performance regression that affects users that use 
Hive serde tables as well?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22144
  
Wait wait .. do we only care about regressions as blockers for the last 
release (2.3)? I'm asking this because I really don't know. If so, I'm okay.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

2018-10-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22237
  
Ah, yea I have a direct access to this branch. Let me just rebase/address 
the comment tomorrow.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22144
  
For instance, 
https://groups.google.com/forum/?utm_medium=email_source=footer#!msg/sketches-user/GmH4-OlHP9g/MW-J7Hg4BwAJ
 this discussion thread was started almost one year ago.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22144
  
It does try to fix a backward compatibility issue. It is found later now 
but still it is true we found a breaking issue to fix. Broken backward 
compatibility that potentially affects a set of users (it at least looks 
blocking an external Hive UDF library repo) sounds like a blocker to me.

Some of Hive compatibility issues are reported late because IMHO it's a bit 
complicated to follow (it requires to take a look for both Hive and Spark at 
least).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

2018-10-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22237
  
Oh wait you left a sign-off. Let me rebase it within tomorrow - wouldn't be 
a big job.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

2018-10-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22237
  
If @gengliangwang find some time to work on this, yea please go ahead.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22047: [SPARK-19851] Add support for EVERY and ANY (SOME...

2018-10-23 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22047#discussion_r227367500
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/AnyAgg.scala
 ---
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.util.TypeUtils
+import org.apache.spark.sql.types._
+
+@ExpressionDescription(
+  usage = "_FUNC_(expr) - Returns true if at least one value of `expr` is 
true.")
--- End diff --

BTW, don't forget to add `since`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22803: [SPARK-25808][BUILD] Upgrade jsr305 version from 1.3.9 t...

2018-10-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22803
  
Yea, similar opinion. If it does not fix an actual problem, I wouldn't 
encourage to fix too .. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22144
  
If we were going for 3.0, then I would definitely leave +1 and I agree that 
we should rather focus on Spark itself as a higher priority - we should do that 
when we go 3.0 and rather drop such cases IMHO. However, looks we are going for 
2.4 for now. 

It looks indeed a mistake and bringing a feature back that we had in 
previous releases, and sounds like the problem is not isolated (let's say, it's 
not specific to a few set UDAFs only), which, IMHO, looks a proper blocker.

How about we bring this back if the changes proposed here are all what we 
need? If the current change is not enough, we could at least mention this 
feature was dropped intentionally in migration guide.

About testing, this PR described "How was this patch tested?" although it 
has no test. @pgandhi999, are you able to write a test?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22782: [HOTFIX] Fix PySpark pip packaging tests by non-a...

2018-10-23 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22782#discussion_r227336895
  
--- Diff: bin/docker-image-tool.sh ---
@@ -79,7 +79,7 @@ function build {
   fi
 
   # Verify that Spark has actually been built/is a runnable distribution
-  # i.e. the Spark JARs that the Docker files will place into the image 
are present
--- End diff --

Therefore, using non-breakable spaces in the codes is obviously not a good 
practice. Please avoid next time.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22782: [HOTFIX] Fix PySpark pip packaging tests by non-a...

2018-10-23 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22782#discussion_r227336318
  
--- Diff: bin/docker-image-tool.sh ---
@@ -79,7 +79,7 @@ function build {
   fi
 
   # Verify that Spark has actually been built/is a runnable distribution
-  # i.e. the Spark JARs that the Docker files will place into the image 
are present
--- End diff --

For the issue itself, It's related to a historical reason for Python. 
Python 2 supported `str` type as bytes like string. It looked a mistake that 
confuses users about the concept between bytes and string, and then Python 3 
introduced `str` as unicode strings concepts like other programing languages.

`open(...).read()` reads it as `str` (which is bytes) in Python 2 but it's 
read in unicode strings in Python 3 - where we need an implicit conversion 
between bytes and strings. Looks it had to be to minimise the breaking changes 
in users codes.

So, bytes to string conversion happened here and unfortunately our 
Jenkins's system default encoding is set to ascii (even though arguably UTF-8 
is common).


For non-ascii itself, please see the justification at 
http://www.scalastyle.org/rules-dev.html in ScalaStyle.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22666: [SPARK-25672][SQL] schema_of_csv() - schema inference fr...

2018-10-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22666
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's inpu...

2018-10-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22775
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json...

2018-10-23 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22775#discussion_r227251515
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -770,8 +776,17 @@ case class SchemaOfJson(
 factory
   }
 
-  override def convert(v: UTF8String): UTF8String = {
-val dt = 
Utils.tryWithResource(CreateJacksonParser.utf8String(jsonFactory, v)) { parser 
=>
+  @transient
+  private lazy val json = child.eval().asInstanceOf[UTF8String]
--- End diff --

Yea, that was my thought, and yea I don't strongly feel we should enforce 
it too in the future but was thinking we better enforce for 2.4.x (if the RC 
fails).

I was thinking about other possibilities to use this expression with 
`from_json` (for instance, let's say .. as an aggregation expression or .. 
automatically somehow collect few examples from a column and return a schema). 
Actually Max proposed an aggregation expression first and that got rejected by 
Reynold, and this way was suggested by him.

> It's also weird that users are willing to write verbose json literal in 
from_json(..., schema = schema_of_json(...)) instead of DDL string.

Yea, it is, but that's what current codes do 
(https://github.com/apache/spark/blob/dbf9cabfb80c157cf0e52014b07c4abeb1aa2ff6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L803).

Nothing serious but I was just thinking about allowing what we need for now 
given there have been some discussion about it before.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json...

2018-10-23 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22775#discussion_r227239707
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -770,8 +776,17 @@ case class SchemaOfJson(
 factory
   }
 
-  override def convert(v: UTF8String): UTF8String = {
-val dt = 
Utils.tryWithResource(CreateJacksonParser.utf8String(jsonFactory, v)) { parser 
=>
+  @transient
+  private lazy val json = child.eval().asInstanceOf[UTF8String]
--- End diff --

Also, basically we expect structured data format in SQL path, and different 
DDL type strings for each record wouldn't be quite useful anyway.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json...

2018-10-23 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22775#discussion_r227239024
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -770,8 +776,17 @@ case class SchemaOfJson(
 factory
   }
 
-  override def convert(v: UTF8String): UTF8String = {
-val dt = 
Utils.tryWithResource(CreateJacksonParser.utf8String(jsonFactory, v)) { parser 
=>
+  @transient
+  private lazy val json = child.eval().asInstanceOf[UTF8String]
--- End diff --

It is possible and that's one case I am trying to disallow here. 
`from_json` only accepts `schema_of_json` when the input is literals. I can't 
imagine a proper usecase by DDL formatted strings from a JSON column.

The main purpose of this expression was `from_json(..., schema = 
schema_of_json(...))`. I was trying to allow what I need only.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22666: [SPARK-25672][SQL] schema_of_csv() - schema inference fr...

2018-10-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22666
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22730: [SPARK-16775][CORE] Remove deprecated accumulator v1 API...

2018-10-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22730
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's inpu...

2018-10-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22775
  
@cloud-fan, mind taking a look please?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22787: [SPARK-25040][SQL] Empty string for non string types sho...

2018-10-22 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22787
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

2018-10-22 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22795#discussion_r227199025
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -3023,6 +3023,42 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
 conversion on returned data. The conversion is not guaranteed to 
be correct and results
 should be checked for accuracy by users.
 """
+
+# The following table shows most of Pandas data and SQL type 
conversions in Pandas UDFs that
+# are not yet visible to the user. Some of behaviors are buggy and 
might be changed in the near
+# future. The table might have to be eventually documented externally.
+# Please see SPARK-25798's PR to see the codes in order to generate 
the table below.
+#
+# 
+-+--+--+---+++++-+-+-++++---+-+-++-+-+-+--+---++
  # noqa
+# |SQL Type \ Pandas 
Value(Type)|None(object(NoneType))|True(bool)|1(int8)|1(int16)|
1(int32)|
1(int64)|1(uint8)|1(uint16)|1(uint32)|1(uint64)|1.0(float16)|1.0(float32)|1.0(float64)|1970-01-01
 00:00:00(datetime64[ns])|1970-01-01 00:00:00-05:00(datetime64[ns, 
US/Eastern])|a(object(string))|  1(object(Decimal))|[1 2 
3](object(array[int32]))|1.0(float128)|(1+0j)(complex64)|(1+0j)(complex128)|A(category)|1
 days 00:00:00(timedelta64[ns])|  # noqa
+# 
+-+--+--+---+++++-+-+-++++---+-+-++-+-+-+--+---++
  # noqa
+# |  boolean|  None|  True|   
True|True|True|True|True| True| 
True| True|   False|   False|   False|  
False|False|
X|   X|X|False|
False| False|  X|   False|  # noqa
+# |  tinyint|  None| 1|
  1|   1|   1|   1|   X|X|  
  X|X|   1|   1|   1|   
   X|X|
X|   X|X|X| 
   X| X|  0|   X|  # noqa
+# | smallint|  None| 1|
  1|   1|   1|   1|   1|X|  
  X|X|   1|   1|   1|   
   X|X|
X|   X|X|X| 
   X| X|  X|   X|  # noqa
+# |  int|  None| 1|
  1|   1|   1|   1|   1|1|  
  X|X|   1|   1|   1|   
   X|X|
X|   X|X|X| 
   X| X|  X|   X|  # noqa
+# |   bigint|  None| 1|
  1|   1|   1|   1|   1|1|  
  1|X|   1|   1|   1|   
   0|   18|
X|   X|X|X| 
   X| X|  X|   X|  # noqa
+# |float|  None|   1.0|
1.0| 1.0| 1.0| 1.0| 1.0|  1.0|  
1.0|  1.0| 1.0| 1.0| 1.0| 

[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

2018-10-22 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22795#discussion_r227198050
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -3023,6 +3023,42 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
 conversion on returned data. The conversion is not guaranteed to 
be correct and results
 should be checked for accuracy by users.
 """
+
+# The following table shows most of Pandas data and SQL type 
conversions in Pandas UDFs that
+# are not yet visible to the user. Some of behaviors are buggy and 
might be changed in the near
+# future. The table might have to be eventually documented externally.
+# Please see SPARK-25798's PR to see the codes in order to generate 
the table below.
+#
+# 
+-+--+--+---+++++-+-+-++++---+-+-++-+-+-+--+---++
  # noqa
+# |SQL Type \ Pandas 
Value(Type)|None(object(NoneType))|True(bool)|1(int8)|1(int16)|
1(int32)|
1(int64)|1(uint8)|1(uint16)|1(uint32)|1(uint64)|1.0(float16)|1.0(float32)|1.0(float64)|1970-01-01
 00:00:00(datetime64[ns])|1970-01-01 00:00:00-05:00(datetime64[ns, 
US/Eastern])|a(object(string))|  1(object(Decimal))|[1 2 
3](object(array[int32]))|1.0(float128)|(1+0j)(complex64)|(1+0j)(complex128)|A(category)|1
 days 00:00:00(timedelta64[ns])|  # noqa
+# 
+-+--+--+---+++++-+-+-++++---+-+-++-+-+-+--+---++
  # noqa
+# |  boolean|  None|  True|   
True|True|True|True|True| True| 
True| True|   False|   False|   False|  
False|False|
X|   X|X|False|
False| False|  X|   False|  # noqa
--- End diff --

I think it's fine .. many conversions here look buggy, for instance, 
`A(category)` with `tinyint` becomes `0` or string conversions .. 
Let's just fix, upgrade arrow and then update this chart later..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22782: [HOTFIX] Fix PySpark pip packaging tests by non-a...

2018-10-22 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22782#discussion_r227173103
  
--- Diff: bin/docker-image-tool.sh ---
@@ -79,7 +79,7 @@ function build {
   fi
 
   # Verify that Spark has actually been built/is a runnable distribution
-  # i.e. the Spark JARs that the Docker files will place into the image 
are present
--- End diff --

It was more because non-breakable space was used instead instead of space. 
It was in utf-8 but non ascii compatible.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

2018-10-22 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22795#discussion_r227026198
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -3023,6 +3023,42 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
 conversion on returned data. The conversion is not guaranteed to 
be correct and results
 should be checked for accuracy by users.
 """
+
+# The following table shows most of Pandas data and SQL type 
conversions in Pandas UDFs that
+# are not yet visible to the user. Some of behaviors are buggy and 
might be changed in the near
+# future. The table might have to be eventually documented externally.
+# Please see SPARK-25798's PR to see the codes in order to generate 
the table below.
+#
--- End diff --

It's not supposed to be exposed and visible to users yet as I said in the 
comments. Let's keep it internal for now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22787: [SPARK-25040][SQL] Empty string for non string types sho...

2018-10-22 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22787
  
It's chaotic ... 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22795: [SPARK-25798][PYTHON] Internally document type conversio...

2018-10-22 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22795
  
cc @viirya, @BryanCutler and @cloud-fan


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22655: [SPARK-25666][PYTHON] Internally document type conversio...

2018-10-22 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22655
  
Hey @viirya, I happened to find some times to work on it - I submitted a PR 
https://github.com/apache/spark/pull/22795.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

2018-10-22 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/22795

[SPARK-25798][PYTHON] Internally document type conversion between Pandas 
data and SQL types in Pandas UDFs

## What changes were proposed in this pull request?

We are facing some problems about type conversions between Pandas data and 
SQL types in Pandas UDFs.
It's even difficult to identify the problems (see #20163 and #22610).

This PR targets to internally document the type conversion table. Some of 
them looks buggy and we should fix them.

Table can be generated via the codes below:

```python
from pyspark.sql.types import *
from pyspark.sql.functions import pandas_udf

columns = [
('none', 'object(NoneType)'),
('bool', 'bool'),
('int8', 'int8'),
('int16', 'int16'),
('int32', 'int32'),
('int64', 'int64'),
('uint8', 'uint8'),
('uint16', 'uint16'),
('uint32', 'uint32'),
('uint64', 'uint64'),
('float64', 'float16'),
('float64', 'float32'),
('float64', 'float64'),
('date', 'datetime64[ns]'),
('tz_aware_dates', 'datetime64[ns, US/Eastern]'),
('string', 'object(string)'),
('decimal', 'object(Decimal)'),
('array', 'object(array[int32])'),
('float128', 'float128'),
('complex64', 'complex64'),
('complex128', 'complex128'),
('category', 'category'),
('tdeltas', 'timedelta64[ns]'),
]

def create_dataframe():
import pandas as pd
import numpy as np
import decimal
pdf = pd.DataFrame({
'none': [None, None],
'bool': [True, False],
'int8': np.arange(1, 3).astype('int8'),
'int16': np.arange(1, 3).astype('int16'),
'int32': np.arange(1, 3).astype('int32'),
'int64': np.arange(1, 3).astype('int64'),
'uint8': np.arange(1, 3).astype('uint8'),
'uint16': np.arange(1, 3).astype('uint16'),
'uint32': np.arange(1, 3).astype('uint32'),
'uint64': np.arange(1, 3).astype('uint64'),
'float16': np.arange(1, 3).astype('float16'),
'float32': np.arange(1, 3).astype('float32'),
'float64': np.arange(1, 3).astype('float64'),
'float128': np.arange(1, 3).astype('float128'),
'complex64': np.arange(1, 3).astype('complex64'),
'complex128': np.arange(1, 3).astype('complex128'),
'string': list('ab'),
'array': pd.Series([np.array([1, 2, 3], dtype=np.int32), 
np.array([1, 2, 3], dtype=np.int32)]),
'decimal': pd.Series([decimal.Decimal('1'), decimal.Decimal('2')]),
'date': pd.date_range('19700101', periods=2).values,
'category': pd.Series(list("AB")).astype('category')})
pdf['tdeltas'] = [pdf.date.diff()[1], pdf.date.diff()[0]]
pdf['tz_aware_dates'] = pd.date_range('19700101', periods=2, 
tz='US/Eastern')
return pdf

types =  [
BooleanType(),
ByteType(),
ShortType(),
IntegerType(),
LongType(),
FloatType(),
DoubleType(),
DateType(),
TimestampType(),
StringType(),
DecimalType(10, 0),
ArrayType(IntegerType()),
MapType(StringType(), IntegerType()),
StructType([StructField("_1", IntegerType())]),
BinaryType(),
]

df = spark.range(2).repartition(1)
results = []
count = 0
total = len(types) * len(columns)
values = []
spark.sparkContext.setLogLevel("FATAL")
for t in types:
result = []
for column, pandas_t in columns:
v = create_dataframe()[column][0]
values.append(v)
try:
row = df.select(pandas_udf(lambda _: 
create_dataframe()[column], t)(df.id)).first()
ret_str = repr(row[0])
except Exception:
ret_str = "X"
result.append(ret_str)
progress = "SQL Type: [%s]\n  Pandas Value(Type): %s(%s)]\n  Result 
Python Value: [%s]" % (
t.simpleString(), v, pandas_t, ret_str)
count += 1
print("%s/%s:\n  %s" % (count, total, progress))
results.append([t.simpleString()] + list(map(str, result)))


schema = ["SQL Type \\ Pandas Value(Type)"] + list(map(lambda 
values_column: "%s(%s)" % (values_column[0], values_column[1][1]), zip(values, 
columns)))
strings = spark.createDataFrame(results, schema=schema)._jdf.showString(20, 
20, False)
print("\n".join(map(lambda line: "# %s  # noqa" % line, 
strings.strip().split("\n"

```

This code is compatible with both Python 2 and 3 but the table was 
generated under Python 2.

## H

[GitHub] spark issue #22787: [SPARK-25040][SQL] Empty string for non string types sho...

2018-10-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22787
  
github looks buggy for now. Let me clean up my comments if they got messed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22787: [SPARK-25040][SQL] Empty string for non string types sho...

2018-10-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22787
  
Yea, looks we should better fix the comments.

LGTM otherwise.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22787: [SPARK-25040][SQL] Empty string for non string types sho...

2018-10-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22787
  
Yea looks good as we discussed. Should we maybe better update the migration 
guide too while we are here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22783: [WIP][BUILD] Fix errors of log4j when pip sanity ...

2018-10-20 Thread HyukjinKwon
Github user HyukjinKwon closed the pull request at:

https://github.com/apache/spark/pull/22783


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22666: [SPARK-25672][SQL] schema_of_csv() - schema inference fr...

2018-10-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22666
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22783: [WIP][BUILD] Fix errors of log4j when pip sanity ...

2018-10-20 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/22783

[WIP][BUILD] Fix errors of log4j when pip sanity checking

## What changes were proposed in this pull request?

PIP sanity checking produces some errors about log4j. I have some 
suspension about it - will file a JIRA and fix this PR description when my 
doubt is clear.

```
Run basic sanity check on pip installed version with spark-submit
log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: target/unit-tests.log (No such file or 
directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.(FileOutputStream.java:213)
at java.io.FileOutputStream.(FileOutputStream.java:133)
at org.apache.log4j.FileAppender.setFile(FileAppender.java:294)
at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:165)
at 
org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:307)
at 
org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:172)
at 
org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:104)
at 
org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:842)
at 
org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:768)
at 
org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:648)
at 
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:514)
at 
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:580)
at 
org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
at org.apache.log4j.LogManager.(LogManager.java:127)
at 
org.apache.spark.internal.Logging$class.initializeLogging(Logging.scala:120)
at 
org.apache.spark.internal.Logging$class.initializeLogIfNecessary(Logging.scala:108)
at 
org.apache.spark.deploy.SparkSubmit.initializeLogIfNecessary(SparkSubmit.scala:71)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:79)
at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Successfully ran pip sanity check
Run basic sanity check with import based
log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: target/unit-tests.log (No such file or 
directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.(FileOutputStream.java:213)
at java.io.FileOutputStream.(FileOutputStream.java:133)
at org.apache.log4j.FileAppender.setFile(FileAppender.java:294)
at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:165)
at 
org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:307)
at 
org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:172)
at 
org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:104)
at 
org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:842)
at 
org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:768)
at 
org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:648)
at 
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:514)
at 
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:580)
at 
org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
at org.apache.log4j.LogManager.(LogManager.java:127)
at 
org.apache.spark.internal.Logging$class.initializeLogging(Logging.scala:120)
at 
org.apache.spark.internal.Logging$class.initializeLogIfNecessary(Logging.scala:108)
at 
org.apache.spark.deploy.SparkSubmit.initializeLogIfNecessary(SparkSubmit.scala:71)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:79)
at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).

[Stage 0:>(0 + 10) 
/ 10]


Successfully ran pip sanity check

[GitHub] spark issue #22662: [SPARK-25627][TEST] Reduce test time for ContinuousStres...

2018-10-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22662
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22776: [SPARK-25779][SQL][TESTS] Remove SQL query tests for fun...

2018-10-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22776
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's inpu...

2018-10-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22775
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22666: [SPARK-25672][SQL] schema_of_csv() - schema inference fr...

2018-10-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22666
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22782: [HOTFIX] Fix PySpark pip packaging tests by non-ascii co...

2018-10-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22782
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22782: [HOTFIX] Fix PySpark pip packaging tests by non-ascii co...

2018-10-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22782
  
pip packaging tests got passed. Let me merge this one since it blocks 
almost every PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22782: [HOTFIX] Fix PySpark pip packaging tests by non-a...

2018-10-20 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22782#discussion_r226833951
  
--- Diff: bin/docker-image-tool.sh ---
@@ -79,7 +79,7 @@ function build {
   fi
 
   # Verify that Spark has actually been built/is a runnable distribution
-  # i.e. the Spark JARs that the Docker files will place into the image 
are present
--- End diff --

Yea, some text editors insert non-breakable spaces and probably that's why. 
I dealt with similar problems at md files before. I think we have scalastyle 
for nonascii but looks not for scripts ..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22782: [HOTFIX] Fix PySpark pip packaging tests by non-a...

2018-10-20 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22782#discussion_r226833113
  
--- Diff: python/pyspark/__init__.py ---
@@ -16,7 +16,7 @@
 #
 
 """
-PySpark is the Python API for Spark.
+PySpark is the Python API for Spark
--- End diff --

yup. I will revert this one too. just intended to test Python tests only 
since Scala tests takes long.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22782: [HOTFIX] Fix PySpark pip packaging tests by non-a...

2018-10-20 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22782#discussion_r226833051
  
--- Diff: dev/run-tests.py ---
@@ -551,7 +551,8 @@ def main():
 if not changed_files or any(f.endswith(".scala")
 or f.endswith("scalastyle-config.xml")
 for f in changed_files):
-run_scala_style_checks()
+# run_scala_style_checks()
--- End diff --

tentative workaround. yup yup. I will revert all back once the tests pass


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22501: [SPARK-25492][TEST] Refactor WideSchemaBenchmark to use ...

2018-10-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22501
  
Yup, I made a fix https://github.com/apache/spark/pull/22782


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22748: [SPARK-25745][K8S] Improve docker-image-tool.sh script

2018-10-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22748
  
@vanzin, the test failure was related. don't merge if the tests are failed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22782: [WIP][HOTFIX] PIP failure fix

2018-10-20 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/22782

[WIP][HOTFIX] PIP failure fix

## What changes were proposed in this pull request?


## How was this patch tested?

Jenkins


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark pip-failure-fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22782.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22782


commit 25435dc588f081c83248a3d6c8a69b568c9cb59e
Author: hyukjinkwon 
Date:   2018-10-20T16:32:56Z

Fix

commit cd5cab6cb981f88456488fde4e177c80ada0a8a8
Author: hyukjinkwon 
Date:   2018-10-20T16:33:25Z

python test




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22501: [SPARK-25492][TEST] Refactor WideSchemaBenchmark to use ...

2018-10-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22501
  
Thanks. It might rather more be related to external factors.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22501: [SPARK-25492][TEST] Refactor WideSchemaBenchmark to use ...

2018-10-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22501
  
I guess it's related with pip packaging tho. 

```
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/setup.py", line 224, in 

'Programming Language :: Python :: Implementation :: PyPy']
  File 
"/tmp/tmp.R2Y98bevgD/3.5/lib/python3.5/site-packages/setuptools/__init__.py", 
line 140, in setup
return distutils.core.setup(**attrs)
  File "/tmp/tmp.R2Y98bevgD/3.5/lib/python3.5/distutils/core.py", line 
148, in setup
dist.run_commands()
  File "/tmp/tmp.R2Y98bevgD/3.5/lib/python3.5/distutils/dist.py", line 
955, in run_commands
self.run_command(cmd)
  File "/tmp/tmp.R2Y98bevgD/3.5/lib/python3.5/distutils/dist.py", line 
974, in run_command
cmd_obj.run()
  File 
"/tmp/tmp.R2Y98bevgD/3.5/lib/python3.5/site-packages/setuptools/command/develop.py",
 line 38, in run
self.install_for_development()
  File 
"/tmp/tmp.R2Y98bevgD/3.5/lib/python3.5/site-packages/setuptools/command/develop.py",
 line 154, in install_for_development
self.process_distribution(None, self.dist, not self.no_deps)
  File 
"/tmp/tmp.R2Y98bevgD/3.5/lib/python3.5/site-packages/setuptools/command/easy_install.py",
 line 729, in process_distribution
self.install_egg_scripts(dist)
  File 
"/tmp/tmp.R2Y98bevgD/3.5/lib/python3.5/site-packages/setuptools/command/develop.py",
 line 189, in install_egg_scripts
script_text = strm.read()
  File "/tmp/tmp.R2Y98bevgD/3.5/lib/python3.5/encodings/ascii.py", line 
26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 
2719: ordinal not in range(128)

```

It's from setup.py


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's inpu...

2018-10-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22775
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22776: [SPARK-25779][SQL][TESTS] Remove SQL query tests for fun...

2018-10-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22776
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-10-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21157
  
The workaround is to use CloudPickler btw. Technically we many cases that 
normal pickler does not support. This one specific case (namedtuple) was 
allowed by this weird hack.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] zeppelin issue #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4

2018-10-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/zeppelin/pull/3206
  
Nope, it will work for both 2.11.8 and 2.11.12. I manually checked. This 
change only uses the methods existing in both 2.11.8 and 2.11.12 at Scala.


---


[GitHub] spark issue #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's inpu...

2018-10-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22775
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's inpu...

2018-10-19 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22775
  
Ah.. let me rebase and sync the tests


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22781: [MINOR][DOC] Fix the building document to describ...

2018-10-19 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22781#discussion_r226816661
  
--- Diff: docs/building-spark.md ---
@@ -12,7 +12,7 @@ redirect_from: "building-with-maven.html"
 ## Apache Maven
 
 The Maven-based build is the build of reference for Apache Spark.
-Building Spark using Maven requires Maven 3.3.9 or newer and Java 8+.
+Building Spark using Maven requires Maven 3.3.9 or newer and Java 8.
--- End diff --

+1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22666: [SPARK-25672][SQL] schema_of_csv() - schema infer...

2018-10-19 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22666#discussion_r226814727
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -3886,6 +3886,31 @@ object functions {
 withExpr(new CsvToStructs(e.expr, schema.expr, options.asScala.toMap))
   }
 
+  /**
+   * Parses a column containing a CSV string and infers its schema.
+   *
+   * @param e a string column containing CSV data.
+   *
+   * @group collection_funcs
+   * @since 3.0.0
+   */
+  def schema_of_csv(e: Column): Column = withExpr(new SchemaOfCsv(e.expr))
+
+  /**
+   * Parses a column containing a CSV string and infers its schema using 
options.
+   *
+   * @param e a string column containing CSV data.
+   * @param options options to control how the CSV is parsed. accepts the 
same options and the
+   *json data source. See [[DataFrameReader#csv]].
+   * @return a column with string literal containing schema in DDL format.
+   *
+   * @group collection_funcs
+   * @since 3.0.0
+   */
+  def schema_of_csv(e: Column, options: java.util.Map[String, String]): 
Column = {
--- End diff --

`schema_of_json` also has only Java specific (I actually suggested to 
minimise exposed functions) since Java specific one can be used in Scala side 
but Scala specific can't be used in Java side.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22773: [SPARK-25785][SQL] Add prettyNames for from_json, to_jso...

2018-10-19 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22773
  
Thank you @viirya and @dongjoon-hyun.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22773: [SPARK-25785][SQL] Add prettyNames for from_json, to_jso...

2018-10-19 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22773
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's inpu...

2018-10-19 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22775
  
Yup, will fix.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22773: [MINOR][SQL] Add prettyNames for from_json, to_json, fro...

2018-10-19 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22773
  
Other JIRAs have different fixed versions. Let me create a new JIRA then.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's inpu...

2018-10-19 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22775
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22776: [SPARK-25779][SQL][TESTS] Remove SQL query tests ...

2018-10-19 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/22776

[SPARK-25779][SQL][TESTS] Remove SQL query tests for function documentation 
by DESCRIBE FUNCTION at SQLQueryTestSuite


Currently, there are some tests testing function descriptions:

```bash
$ grep -ir "describe function" sql/core/src/test/resources/sql-tests/inputs
sql/core/src/test/resources/sql-tests/inputs/json-functions.sql:describe 
function to_json;
sql/core/src/test/resources/sql-tests/inputs/json-functions.sql:describe 
function extended to_json;
sql/core/src/test/resources/sql-tests/inputs/json-functions.sql:describe 
function from_json;
sql/core/src/test/resources/sql-tests/inputs/json-functions.sql:describe 
function extended from_json;
```

Looks there are not quite good points about testing them since we're not 
going to test documentation itself.
For `DESCRIBE FCUNTION` functionality itself, they are already being tested 
here and there.
See the test failures in https://github.com/apache/spark/pull/18749 (where 
I added examples to function descriptions)

We better remove those tests so that people don't add such tests in the SQL 
tests.

## How was this patch tested?

Manual.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-25779

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22776.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22776


commit 7f4cd6d00b3bb27f123a5be4a7f5d4979d513c35
Author: hyukjinkwon 
Date:   2018-10-19T09:28:27Z

Remove SQL query tests for function documentation by DESCRIBE FUNCTION at 
SQLQueryTestSuite




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22666: [SPARK-25672][SQL] schema_of_csv() - schema inference fr...

2018-10-19 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22666
  
Should be ready for a look now. Would you mind taking a look please 
@cloud-fan and @gatorsmile?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's inpu...

2018-10-19 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22775
  
This should be targeted to 2.4 .. otherwise we should describe the 
behaviour change at migration note.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json...

2018-10-19 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/22775

[SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's input json as literal 
only

## What changes were proposed in this pull request?

The main purpose of `schema_of_json` is the usage of combination with 
`from_json` (to make up the leak of schema inference) which takes its schema 
only as literal; however, currently `schema_of_json` allows JSON input as 
non-literal expressions (e.g, column).

This was mistakenly allowed - we don't have to take other usages rather 
then the main purpose into account for now.

This PR makes a followup to only allow literals for `schema_of_json`'s JSON 
input. We can allow non literal expressions later when it's needed or there are 
some usecase for it.

## How was this patch tested?

Unit tests were added.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-25447-followup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22775.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22775


commit 2705aa8739ecfc1d2ea8aea40082734084f36514
Author: hyukjinkwon 
Date:   2018-10-19T06:40:58Z

Make schema_of_json's input json as liternal only




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22429: [SPARK-25440][SQL] Dumping query execution info to a fil...

2018-10-19 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22429
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22773: [MINOR][SQL] Add prettyNames for from_json, to_json, fro...

2018-10-19 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22773
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22761: [MINOR][DOC] Spacing items in migration guide for readab...

2018-10-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22761
  
Merged to master and branch-2.4.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22773: [MINOR][SQL] Add prettyNames for from_json, to_js...

2018-10-18 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/22773

[MINOR][SQL] Add prettyNames for from_json, to_json, from_csv, and 
schema_of_json

## What changes were proposed in this pull request?

This PR adds `prettyNames` for `from_json`, `to_json`, `from_csv`, and 
`schema_of_json` so that appropriate names are used.

## How was this patch tested?

Unit tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark minor-prettyNames

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22773.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22773


commit 3447e73989e39d0c052cf69e5a3e80d1ebb221dc
Author: hyukjinkwon 
Date:   2018-10-19T05:28:55Z

Add prettyNames for from_json, to_json, from_csv, and schema_of_json




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22466: [SPARK-25464][SQL] Create Database to the locatio...

2018-10-18 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22466#discussion_r226536773
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala 
---
@@ -840,12 +840,19 @@ abstract class DDLSuite extends QueryTest with 
SQLTestUtils {
   }
 
   test("create table in default db") {
-val catalog = spark.sessionState.catalog
-val tableIdent1 = TableIdentifier("tab1", None)
-createTable(catalog, tableIdent1)
-val expectedTableIdent = tableIdent1.copy(database = Some("default"))
-val expectedTable = generateTable(catalog, expectedTableIdent)
-checkCatalogTables(expectedTable, 
catalog.getTableMetadata(tableIdent1))
+var tablePath: Option[URI] = None
--- End diff --

`var tablePath: URI = null`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22466: [SPARK-25464][SQL] Create Database to the locatio...

2018-10-18 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22466#discussion_r226536735
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -207,6 +207,14 @@ class SessionCatalog(
   "you cannot create a database with this name.")
 }
 validateName(dbName)
+// SPARK-25464 fail if DB location exists and is not empty
+val dbPath = new Path(dbDefinition.locationUri)
+val fs = dbPath.getFileSystem(hadoopConf)
+if (!externalCatalog.databaseExists(dbName) && fs.exists(dbPath)
+  && fs.listStatus(dbPath).nonEmpty) {
--- End diff --

Should we necessarily list up files? it's potentially expensive. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22466: [SPARK-25464][SQL] Create Database to the locatio...

2018-10-18 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22466#discussion_r226536585
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala 
---
@@ -2370,4 +2370,17 @@ class HiveDDLSuite
   ))
 }
   }
+
+  test("SPARK-25464 create Database with non empty location") {
+val dbName = "dbwithcustomlocation"
+withTempDir { tmpDir =>
+  val parentDir = tmpDir.getParent
+  val expectedMsg = s"Cannot create database at location $parentDir 
because the path is not " +
+s"empty."
--- End diff --

leading `s` can be removed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22466: [SPARK-25464][SQL] Create Database to the locatio...

2018-10-18 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22466#discussion_r226536555
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala 
---
@@ -2370,4 +2370,17 @@ class HiveDDLSuite
   ))
 }
   }
+
+  test("SPARK-25464 create Database with non empty location") {
--- End diff --

`create a database with a non-empty location`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22466: [SPARK-25464][SQL] Create Database to the locatio...

2018-10-18 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22466#discussion_r226536442
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala 
---
@@ -840,12 +840,19 @@ abstract class DDLSuite extends QueryTest with 
SQLTestUtils {
   }
 
   test("create table in default db") {
-val catalog = spark.sessionState.catalog
-val tableIdent1 = TableIdentifier("tab1", None)
-createTable(catalog, tableIdent1)
-val expectedTableIdent = tableIdent1.copy(database = Some("default"))
-val expectedTable = generateTable(catalog, expectedTableIdent)
-checkCatalogTables(expectedTable, 
catalog.getTableMetadata(tableIdent1))
+var tablePath: Option[URI] = None
+try {
+  val catalog = spark.sessionState.catalog
+  val tableIdent1 = TableIdentifier("tab1", None)
+  createTable(catalog, tableIdent1)
+  val expectedTableIdent = tableIdent1.copy(database = Some("default"))
+  val expectedTable = generateTable(catalog, expectedTableIdent)
+  tablePath = Some(expectedTable.location)
+  checkCatalogTables(expectedTable, 
catalog.getTableMetadata(tableIdent1))
+} finally {
+  // This is external table,so it is required to deleted the path
--- End diff --

`this is an external table`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22466: [SPARK-25464][SQL] Create Database to the locatio...

2018-10-18 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22466#discussion_r226536456
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala 
---
@@ -840,12 +840,19 @@ abstract class DDLSuite extends QueryTest with 
SQLTestUtils {
   }
 
   test("create table in default db") {
-val catalog = spark.sessionState.catalog
-val tableIdent1 = TableIdentifier("tab1", None)
-createTable(catalog, tableIdent1)
-val expectedTableIdent = tableIdent1.copy(database = Some("default"))
-val expectedTable = generateTable(catalog, expectedTableIdent)
-checkCatalogTables(expectedTable, 
catalog.getTableMetadata(tableIdent1))
+var tablePath: Option[URI] = None
+try {
+  val catalog = spark.sessionState.catalog
+  val tableIdent1 = TableIdentifier("tab1", None)
+  createTable(catalog, tableIdent1)
+  val expectedTableIdent = tableIdent1.copy(database = Some("default"))
+  val expectedTable = generateTable(catalog, expectedTableIdent)
+  tablePath = Some(expectedTable.location)
+  checkCatalogTables(expectedTable, 
catalog.getTableMetadata(tableIdent1))
+} finally {
+  // This is external table,so it is required to deleted the path
--- End diff --

`it is required to delete`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22466: [SPARK-25464][SQL] Create Database to the locatio...

2018-10-18 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22466#discussion_r226536304
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala 
---
@@ -840,12 +840,19 @@ abstract class DDLSuite extends QueryTest with 
SQLTestUtils {
   }
 
   test("create table in default db") {
-val catalog = spark.sessionState.catalog
-val tableIdent1 = TableIdentifier("tab1", None)
-createTable(catalog, tableIdent1)
-val expectedTableIdent = tableIdent1.copy(database = Some("default"))
-val expectedTable = generateTable(catalog, expectedTableIdent)
-checkCatalogTables(expectedTable, 
catalog.getTableMetadata(tableIdent1))
+var tablePath: Option[URI] = None
+try {
+  val catalog = spark.sessionState.catalog
+  val tableIdent1 = TableIdentifier("tab1", None)
+  createTable(catalog, tableIdent1)
+  val expectedTableIdent = tableIdent1.copy(database = Some("default"))
+  val expectedTable = generateTable(catalog, expectedTableIdent)
+  tablePath = Some(expectedTable.location)
+  checkCatalogTables(expectedTable, 
catalog.getTableMetadata(tableIdent1))
+} finally {
+  // This is external table,so it is required to deleted the path
--- End diff --

tiny nit: `e,` -> `e ,`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22466: [SPARK-25464][SQL] Create Database to the location,only ...

2018-10-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22466
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22772: [SPARK-24499][SQL][DOC][Followup] Fix some broken links

2018-10-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22772
  
It's okay. the doc fix was huge and there should likely be some mistakes. I 
will read it closely too this weekends.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22666: [SPARK-25672][SQL] schema_of_csv() - schema inference fr...

2018-10-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22666
  
This is a WIP.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22429: [SPARK-25440][SQL] Dumping query execution info to a fil...

2018-10-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22429
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22429: [SPARK-25440][SQL] Dumping query execution info to a fil...

2018-10-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22429
  
I am able to address his comments for his vacation. Please keep reviewing 
this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22503: [SPARK-25493][SQL] Use auto-detection for CRLF in CSV da...

2018-10-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22503
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22503: [SPARK-25493][SQL] Use auto-detection for CRLF in...

2018-10-18 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22503#discussion_r226524439
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -220,6 +221,17 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils with Te
 }
   }
 
+  test("crlf line separators in multiline mode") {
--- End diff --

nit: -> `SPARK-25493: crlf line separators in multiline mode` 

when a PR fixes a specific problem, let's add the jira prefix in the test 
name next time.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22576: [SPARK-25560][SQL] Allow FunctionInjection in SparkExten...

2018-10-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22576
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    4   5   6   7   8   9   10   11   12   13   >