Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23063
Let me leave a cc for @holdenk, @MLnick, @jkbradley and @mengxr FYI.
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23034
@zsxwing sure. Sorry that I rushed. Will do next time.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23052
I think now it should be good timing to match the behaviours.
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23056
Merged to master.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23056
@BryanCutler, let me merge this. Let's do the ML one and then clean up both
comments throughout ML and MLlib at once
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23052
related another try https://github.com/apache/spark/pull/13252
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23052
One try to add some tests for reading/writing empty dataframes was here
https://github.com/apache/spark/pull/13253 fyi
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23052
Which should be ... this https://github.com/apache/spark/pull/12855
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23054
retest this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23056
retest this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23056
retest this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/22309
adding @liancheng BTW. IIRC, he took a look for this one before and
abandoned the change (fix me if I'm wrongly remembering
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/23055#discussion_r234086569
--- Diff:
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
@@ -74,8 +74,13 @@ private[spark] abstract class BasePythonRunner
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/23055#discussion_r234081475
--- Diff:
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
@@ -74,8 +74,13 @@ private[spark] abstract class BasePythonRunner
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23056
retest this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/23056#discussion_r234080468
--- Diff: python/pyspark/mllib/tests/test_linalg.py ---
@@ -0,0 +1,642 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/23056#discussion_r234080249
--- Diff: python/pyspark/testing/mllibutils.py ---
@@ -0,0 +1,44 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/23046#discussion_r234073703
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala
---
@@ -280,7 +280,7 @@ object ShuffleExchangeExec
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23055
cc @rdblue, @vanzin and @haydenjeune
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
GitHub user HyukjinKwon opened a pull request:
https://github.com/apache/spark/pull/23055
[SPARK-26080][SQL] Disable 'spark.executor.pyspark.memory' always on Windows
## What changes were proposed in this pull request?
`resource` package is a Unit specific package. See
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23034
Thank you @BryanCutler.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/23046#discussion_r234063905
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala
---
@@ -280,7 +280,7 @@ object ShuffleExchangeExec
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/23052#discussion_r234062564
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
---
@@ -174,13 +174,18 @@ private[csv] class
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23052
@MaxGekk, actually this is kind of important behaviour change. This
basically means we're unable to read the empty files back. Similar changes were
proposed in Parquet few years ago (by me
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23047
Merged to branch-2.4.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23034
Also, @BryanCutler, I think we can talk about locations of
`testing/...util.py` later when we finished to split the tests. Moving utils
would probably cause less conflicts and should be good
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23034
Merged to master.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23034
@BryanCutler, should be ready to work on ML and MLlib as well.
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/23034#discussion_r233829511
--- Diff: python/pyspark/testing/streamingutils.py ---
@@ -0,0 +1,189 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23034
Will go and merge this tomorrow if there's no outstanding issues.
cc @zsxwing and @tdas.
---
-
To unsubscribe, e
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23012
Merged to master.
Thanks @felixcheung.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23034
retest this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23033
Merged to master.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23033
I am merging this for the same reason with #23021. Let me know if there's
any concern even after this got merged
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23033
@BryanCutler, looks we should add `pyspark.ml.tests` at
https://github.com/apache/spark/blob/master/python/run-tests.py#L252-L253 so
that we can run unittests first over doc tests (because
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/20788#discussion_r233678942
--- Diff: python/pyspark/sql/tests/test_dataframe.py ---
@@ -375,6 +375,19 @@ def test_generic_hints(self):
plan = df1.join(df2.hint
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/20788
Thanks. @DylanGuedes.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23014
Merged to master.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23039
@MaxGekk, I think the main purpose of this PR is rather to introduce
`spark.sql.debug.maxToStringFields` .. let's fix PR description and title
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23012
Yup, will address the other comments and update the PR accordingly.
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23034
retest this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23033
Yup will do.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23034
I haven't tested the kinesis logic yet. I will check it via Jenkins.
Line counts:
```
751 ./test_dstream.py
89 ./test_kinesis.py
158
GitHub user HyukjinKwon opened a pull request:
https://github.com/apache/spark/pull/23034
[WIP][SPARK-26035][PYTHON] Break large streaming/tests.py files into
smaller files
## What changes were proposed in this pull request?
This PR continues to break down a big large file
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21914
Please ask that to the mailing list.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23033
Rough line distributions look like this:
```
237 ./test_serializers.py
739 ./test_rdd.py
499 ./test_readwrite.py
69 ./test_join.py
161
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23033
cc'ing @BryanCutler and @squito.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands
GitHub user HyukjinKwon opened a pull request:
https://github.com/apache/spark/pull/23033
[SPARK-26036][PYTHON] Break large tests.py files into smaller files
## What changes were proposed in this pull request?
This PR continues to break down a big large file into smaller
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23021
Merged to master.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23021
I am merging this in - maybe I am rushing it but please allow me to go
ahead since it's going to block other PySpark PRs.
At worst case, I am willing to revert and propose this again
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23012
Ah .. right makes sense to me. Thanks @shaneknapp. +1
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23021
adding @holdenk, @ueshin and @icexelloss as well.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23021
adding @icexelloss as well.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23021
> Did you test on python3 as well?
Of course!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/22954#discussion_r233292436
--- Diff: R/pkg/R/SQLContext.R ---
@@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL,
samplingRatio = 1.0,
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23012
@shaneknapp, do you roughly know how difficult it is (and do you have some
time shortly) to upgrade R from 3.1 to 3.4? I am asking this because I had some
difficulties when I tried to manually
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/23012#discussion_r233290797
--- Diff: docs/index.md ---
@@ -31,7 +31,8 @@ Spark runs on both Windows and UNIX-like systems (e.g.
Linux, Mac OS). It's easy
locally on one
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23021
> Could you add some descriptions to run a single test file or a single
test case if exists?
D
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23021
Yup!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/22994
I haven't taken a look super closely but the idea looks itself okay. Is it
urgent? if yes, yup. I don't object to go ahead right away. Otherwise, might be
good to leave open for few days
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23021
I am going to push after testing and double checking. The line counts would
look like this
```
54 ./test_utils.py
199 ./test_catalog.py
503
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23021
> I'd break the pandas udf one into smaller pieces too, as you suggested.
We should also investigate why the runtime didn't improve ...
One suspection from my investigat
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/23021#discussion_r233269827
--- Diff: python/pyspark/testing/sqlutils.py ---
@@ -0,0 +1,268 @@
+#
--- End diff --
Yea, similar thought. One thing is though testing
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23021
Yup, will break pandas one into smaller ones as well.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/22962
Also please fix the test. The test doesn't really look clear. I actually
quite didn't like the test written here now
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/22962#discussion_r233131375
--- Diff: python/pyspark/taskcontext.py ---
@@ -147,8 +147,8 @@ def __init__(self):
@classmethod
def _getOrCreate(cls
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/22962#discussion_r233130494
--- Diff: python/pyspark/taskcontext.py ---
@@ -147,8 +147,8 @@ def __init__(self):
@classmethod
def _getOrCreate(cls
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/22962#discussion_r233130221
--- Diff: python/pyspark/taskcontext.py ---
@@ -147,8 +147,8 @@ def __init__(self):
@classmethod
def _getOrCreate(cls
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23021
Elapsed time looks virtually same. All tests looks running fine. The last
commit should show skipped tests fine as well. Should be ready for a look
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/23004#discussion_r233012392
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
---
@@ -237,6 +237,13 @@ case class
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/22979#discussion_r233009506
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
---
@@ -104,6 +106,14 @@ class UnivocityParser
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21588
To all, so how about we start the fix @wangyum tried before? If we are
generally agreed upon the direction itself, upgrading Hive to 2.3 (or 3), I
would like to encourage him to continue #20659
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21588
The test failure itself doesn't look caused by this change. The tests will
fail anyway with a different error message.
If the goal is really just to check if the tests pass or not, you
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/22962#discussion_r232991503
--- Diff: python/pyspark/taskcontext.py ---
@@ -147,8 +147,8 @@ def __init__(self):
@classmethod
def _getOrCreate(cls
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/22962
The main code change LGTM too in any event
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/22962#discussion_r232990319
--- Diff: python/pyspark/tests.py ---
@@ -618,10 +618,13 @@ def test_barrier_with_python_worker_reuse(self):
"""
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23021
For your information, here's the line counts for each file:
```
52 ./test_utils.py
197 ./test_catalog.py
43 ./test_group.py
318 ./test_session.py
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23021
FWIW, I at least double checked if they are any tests missing, and if they
are actually being ran (via coverage
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23021
adding @rxin (derived from mailing list)
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23021
@BryanCutler and @squito, Here is the official first attempt to break
`pyspark/sql/tests.py` into multiple small files.
If there are no outstanding issues (for instance, if we
GitHub user HyukjinKwon opened a pull request:
https://github.com/apache/spark/pull/23021
[SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files
## What changes were proposed in this pull request?
This is the official first attempt to break huge single
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23020
retest this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/22954#discussion_r232895848
--- Diff: R/pkg/R/SQLContext.R ---
@@ -172,36 +257,72 @@ getDefaultSqlSource <- function() {
createDataFrame <- function(data, schema
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23006
Merged to master.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/23014#discussion_r232893546
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
---
@@ -101,10 +101,11 @@ private void
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23014
> The reason is that each bucket file is too big
Can you elaborate please? Is it because we don't chunk each file into
multiple splits when we read bucketed ta
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/23014#discussion_r232885260
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
---
@@ -101,7 +101,8 @@ private void
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23018
Looks fine to me. adding @cloud-fan and @hvanhovell
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/23018#discussion_r232883084
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala
---
@@ -469,7 +471,21 @@ abstract class TreeNode[BaseType
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23012
In this way, we could postpone R upgrade after Spark 3.0.0 release in
Jenkins, and could still test the deprecated R version 3.1
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23012
Nice. Thanks!. BTW Felix, are you maybe worrying about that we happen to
upgrade R version in Jenkins to 3.4 and .. we could break lower deprecated R
version support in Spark 3.0 I guess
Github user HyukjinKwon commented on the issue:
https://github.com/apache/zeppelin/pull/3206
Thank you all!!
---
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23008
BTW, let.s test them in end-to-end. For instance,
`spark.range(1).rdd.map(lambda blabla).count()`
---
-
To unsubscribe
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23008
If the perf diff is big, let's don't change but document that we can use
`CloudPickleSerializer()` to avoid breaking change.
If the perf diff is rather trivial, let's check if we can
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23008
Nope, it should be manually done.. should be great to have it FWIW.
I am not yet sure how we're going to measure the performance. I think you
can show the performance diff
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23011
Merged to master.
Thanks, @srowen.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/23012#discussion_r232742917
--- Diff: R/pkg/R/sparkR.R ---
@@ -283,6 +283,10 @@ sparkR.session <- function(
enableHiveSupport = TRUE,
...) {
+ if (ut
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23012
Yea will take a look to address. But about documenting unsupported, if we
explicitly are going to say it's unsupported and dropped, for instance, we
should remove the compatibility change
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/22429
Ooops i rished to read. Yea but still sounds related but orthogonal. Let's
move it to mailing list. That should be the best place to discuss further
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/22429
@boy-uber, for structured streaming, let's do it out of this PR. I think
the actual change of this PR can be small (1.). We can change this API for
structured streaming later if needed since
301 - 400 of 12680 matches
Mail list logo