Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/23008
Interestingly, `cloudpickle` adds overhead even if the namedtuple is
importable:
```bash
$ cat a.py
from collections import namedtuple
A = namedtuple("A", [&
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/23008
Is there a benchmark suite for PySpark?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
Closing this PR to continue the discussion in the new one.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
Github user superbobry closed the pull request at:
https://github.com/apache/spark/pull/21157
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
@HyukjinKwon done in #23008.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e
GitHub user superbobry opened a pull request:
https://github.com/apache/spark/pull/23008
[SPARK-22674][PYTHON] Removed the namedtuple pickling patch
## What changes were proposed in this pull request?
Prior to this PR PySpark patched ``collections.namedtuple`` to make
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
> I think people do defined NamedTuples in Notebooks, so I'm going to stick
with -1.
@holdenk I understand your point, but there is still something we can do
without breaking exist
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
@HyukjinKwon do you mean change the default serializer to cloudpickle and
remove _hack_namedtuple?
---
-
To unsubscribe, e
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
`cloudpickle` does indeed support pickling namedtuples. Maybe the way to go
is to remove the patch advertise `cloudpickle` serializer for projects relying
on the old behaviour. Wdyt @holdenk
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
> If removing the hack entirely is going to brake named tuples defined in
the repl I'm a -1 on that change.
Yes, but it might be OK for two reasons: people rarely define namedtup
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
Yes, it will break IPython notebooks as well. I wonder how often people
actually defined namedtuples in a notebook?
Emitting a warning is a less extreme option, yes
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
Yes, that is correct. That is why I think hijacking behaviour should be
removed. It silently slows down the job and does not notify the user that a
trivial change such as making the namedtuple
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
Nope, the job I was referring to is not open source; but I guess the
speedup is easy to justify: much less payload and faster deserialization:
```
>>> from collectio
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
> Is it possible to keep the current hack for things which can't be
pickled, but remove the hack in the situation where the namedtuple is well
behaved and it could be pickled direc
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
@rxin the ones mentioned as 1 and 2 in the PR description:
https://superbobry.github.io/pyspark-silently-breaks-your-namedtuples.html
https://superbobry.github.io/tensorflowonspark
Github user superbobry commented on a diff in the pull request:
https://github.com/apache/spark/pull/21157#discussion_r221229392
--- Diff: python/pyspark/mllib/fpm.py ---
@@ -94,12 +100,9 @@ def train(cls, data, minSupport=0.3, numPartitions=-1):
model = callMLlibFunc
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
> so this change would introduce a pretty big regression?
The change does introduce a regression as some namedtuples will become
unpicklable. However, it makes pickling in PySpark m
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
> So, real downside of removing this now is we disallow global scope
namedtuple.
Importable namedtuples and their subclasses could still be used inside an
RDD. Only the namedtup
Github user superbobry closed the pull request at:
https://github.com/apache/spark/pull/21180
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21180
@HyukjinKwon I've reopened the PR doing this and rebased it on the latest
master, see #21157.
---
-
To unsubscribe, e-mail
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
Reopened and rebased to be merged into the 3.X branch. See discussion in
#21180.
---
-
To unsubscribe, e-mail: reviews
GitHub user superbobry reopened a pull request:
https://github.com/apache/spark/pull/21157
[SPARK-22674][PYTHON] Removed the namedtuple pickling patch
## What changes were proposed in this pull request?
This is a breaking change.
Prior to this commit PySpark
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21180
@HyukjinKwon sure, should I just open a new PR against master or some other
branch?
---
-
To unsubscribe, e-mail: reviews
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21180
> Yea the next will likely be 3.0.0 and actually many fixes are being
targeted to 3.0.0 (even one of mine).
Awesome, I'm OK to wait until 3.0.0 then. Is there a release date alre
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21180
The only workaround I know is to manually remove the __reduce__ method for
each namedtuple class in use. This has to be done on both the driver and the
executors.
> The fix itself lo
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21180
@HyukjinKwon could you reference other PySpark maintainers who could do a
review and share their thoughts
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21180
@HyukjinKwon do you think asking for feedback on the dev mailing list might
help decide, whether it is OK to merge it in 2.X?
I want to point out that the PR
* does not break any
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21180
Thanks for getting back and no worries about the delay.
> [...] can we target 3.0.0 for this? I don't have enough nerve to push
this in to fix this hack within 2.4.0 since the c
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21180
Sorry to bug you @HyukjinKwon, but I would really like for this patch to
make it into the next PySpark release. Would you have time in the following
weeks to have another look
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21180
Hey @HyukjinKwon, apologies for yet another ping. Should I ask for someone
else to review the PR on the spark-dev mailing list
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21180
Friendly ping @HyukjinKwon.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21180
I agree that in theory, the use of `_getframe` is somewhat limiting, but I
doubt it would be a major concern in practice. Python does not have a standard,
so the best we can hope for, as far
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21180
> I think you know the current approach doesn't look safe, right?
Could you elaborate on that, ple
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21180
@HyukjinKwon, @felixcheung did you have the time to look into this? If you
disagree with the proposed implementation, I think we should at least document
the current behaviour somewhere
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21180
Hey @HyukjinKwon and @felixcheung, do you think the PR is good to be merged
as-is, or would you like me to think further about how to make it more robust
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21180
Hi @HyukjinKwon, could you kindly have a look at the new version? It is
backwards compatible and has the same robustness guarantees as
`collections.namedtuple` from CPython
Github user superbobry commented on a diff in the pull request:
https://github.com/apache/spark/pull/21180#discussion_r184889165
--- Diff: python/pyspark/serializers.py ---
@@ -523,7 +523,21 @@ def namedtuple(*args, **kwargs):
for k, v
Github user superbobry commented on a diff in the pull request:
https://github.com/apache/spark/pull/21180#discussion_r184860643
--- Diff: python/pyspark/serializers.py ---
@@ -523,7 +523,21 @@ def namedtuple(*args, **kwargs):
for k, v
GitHub user superbobry opened a pull request:
https://github.com/apache/spark/pull/21180
[SPARK-22674][PYTHON] Disabled _hack_namedtuple for picklable namedtuples
Prior to this PR ``_hack_namedtuple`` was applied to all namedtuples,
regardless if they were defined on the top
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
Closing in favour of #21180.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e
Github user superbobry closed the pull request at:
https://github.com/apache/spark/pull/21157
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
One improvement we can make is change the patch to bypass namedtuples which
are importable. This would resolve the issues with namedtuples coming from
third-party libraries. I can open a new PR
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
Yes, we can backport some of the cloudpickle code to make the patch less
fragile. This would be a nontrivial change in an already complex code, but I'd
be happy to sketch this if there's
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
> Does the test even pass?
The tests should pass module the tests specifically checking the behaviour
being removed. I think the failing RDD test is in this group as w
GitHub user superbobry opened a pull request:
https://github.com/apache/spark/pull/21157
[SPARK-22674][PYTHON] Removed the namedtuple pickling patch
## What changes were proposed in this pull request?
This is a breaking change.
Prior to this commit PySpark patched
Github user superbobry closed the pull request at:
https://github.com/apache/spark/pull/19992
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/19992
@squito I think it's fine to just close the PR/JIRA issue.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
Github user superbobry commented on a diff in the pull request:
https://github.com/apache/spark/pull/19992#discussion_r159153806
--- Diff: core/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala
---
@@ -2022,12 +1947,7 @@ private[spark] object JsonProtocolSuite extends
Github user superbobry commented on a diff in the pull request:
https://github.com/apache/spark/pull/19992#discussion_r158701191
--- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala ---
@@ -444,12 +444,15 @@ private[spark] object JsonProtocol {
("
Github user superbobry commented on a diff in the pull request:
https://github.com/apache/spark/pull/19992#discussion_r158689120
--- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala ---
@@ -988,12 +991,20 @@ private[spark] object JsonProtocol {
rddInfo
Github user superbobry commented on a diff in the pull request:
https://github.com/apache/spark/pull/19992#discussion_r158688956
--- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala ---
@@ -444,12 +444,15 @@ private[spark] object JsonProtocol {
("
Github user superbobry commented on a diff in the pull request:
https://github.com/apache/spark/pull/19992#discussion_r158154921
--- Diff: core/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala
---
@@ -2022,12 +1947,7 @@ private[spark] object JsonProtocolSuite extends
Github user superbobry commented on a diff in the pull request:
https://github.com/apache/spark/pull/19992#discussion_r158028287
--- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala ---
@@ -444,12 +444,15 @@ private[spark] object JsonProtocol {
("
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/19992
Can someone have a look at the tests, please? I can't see the failure (and
in theory, the change should not affect SparkR
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/19992
Minor update: I've simulated #18162 on one of our 80G event logs and
(unless there is a bug in the filtering code) the log shrank to 157M. The
effect of this patch was almost negligible
Github user superbobry commented on a diff in the pull request:
https://github.com/apache/spark/pull/19992#discussion_r157340550
--- Diff: core/src/main/scala/org/apache/spark/storage/StorageLevel.scala
---
@@ -132,6 +132,23 @@ class StorageLevel private(
override def
Github user superbobry commented on a diff in the pull request:
https://github.com/apache/spark/pull/19992#discussion_r157311572
--- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala ---
@@ -988,12 +997,16 @@ private[spark] object JsonProtocol {
rddInfo
Github user superbobry commented on a diff in the pull request:
https://github.com/apache/spark/pull/19992#discussion_r157311547
--- Diff: core/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala
---
@@ -2022,12 +1947,7 @@ private[spark] object JsonProtocolSuite extends
Github user superbobry commented on a diff in the pull request:
https://github.com/apache/spark/pull/19992#discussion_r157257388
--- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala ---
@@ -444,11 +444,20 @@ private[spark] object JsonProtocol {
("
GitHub user superbobry opened a pull request:
https://github.com/apache/spark/pull/19992
[SPARK-22805][CORE] Use StorageLevel aliases in event logs
The format of event logs uses redundant representation for storage
levels, for instance StorageLevel.DISK_ONLY is represented
Github user superbobry closed the pull request at:
https://github.com/apache/spark/pull/17230
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/19458
@jerryshao the error looks a bit unrelated.
```
sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: null
```
The suite passes locally for me
Github user superbobry commented on a diff in the pull request:
https://github.com/apache/spark/pull/19458#discussion_r146491977
--- Diff:
core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala ---
@@ -100,7 +100,17 @@ private[spark] class DiskBlockManager(conf
Github user superbobry commented on a diff in the pull request:
https://github.com/apache/spark/pull/19458#discussion_r146213702
--- Diff:
core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala ---
@@ -100,7 +100,16 @@ private[spark] class DiskBlockManager(conf
Github user superbobry commented on a diff in the pull request:
https://github.com/apache/spark/pull/19458#discussion_r144655974
--- Diff:
core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala ---
@@ -100,7 +100,16 @@ private[spark] class DiskBlockManager(conf
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/19458
On second thought, the documentation of `DiskBlockManager.getAllFiles`
states that it returns all files in the directories managed by
`DiskBlockManager`. Filtering anything would violate
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/19458
> Instead of filtering out temp blocks, why not adding parsing rule for
`TempLocalBlockId` and `TempShuffleBlockId`?
Because this fixes the issue out 2 out of 3 possible temp fi
Github user superbobry commented on a diff in the pull request:
https://github.com/apache/spark/pull/19458#discussion_r143586763
--- Diff:
core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala ---
@@ -100,7 +102,9 @@ private[spark] class DiskBlockManager(conf
GitHub user superbobry opened a pull request:
https://github.com/apache/spark/pull/19458
[SPARK-7][CORE] DiskBlockManager.getAllBlocks now tolerates temp files
## What changes were proposed in this pull request?
Prior to this commit getAllBlocks implicitly assumed
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/19369
I've fixed the failing `DiskStoreSuite` and ensured the other two suites
also pass fine.
---
-
To unsubscribe, e-mail
Github user superbobry commented on a diff in the pull request:
https://github.com/apache/spark/pull/19369#discussion_r142392349
--- Diff:
core/src/main/scala/org/apache/spark/storage/BlockReplicationPolicy.scala ---
@@ -85,11 +65,9 @@ object BlockReplicationUtils
Github user superbobry commented on a diff in the pull request:
https://github.com/apache/spark/pull/19369#discussion_r141672528
--- Diff: core/src/main/scala/org/apache/spark/storage/DiskStore.scala ---
@@ -67,7 +67,7 @@ private[spark] class DiskStore(
var threwException
Github user superbobry commented on a diff in the pull request:
https://github.com/apache/spark/pull/19369#discussion_r141671388
--- Diff:
core/src/main/scala/org/apache/spark/storage/BlockReplicationPolicy.scala ---
@@ -85,11 +65,9 @@ object BlockReplicationUtils
Github user superbobry commented on a diff in the pull request:
https://github.com/apache/spark/pull/19369#discussion_r141670392
--- Diff: core/src/main/scala/org/apache/spark/storage/DiskStore.scala ---
@@ -49,7 +49,7 @@ private[spark] class DiskStore(
private val
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/19369
I've looked at the allocation profile of the sample, it does indeed contain
strings from other sources, so the heap histogram is not representative.
However, the allocations from
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/19369
Will have a look at the failing tests tomorrow and also have a look at the
origins of the strings in the sample above
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/19369
> Yes, if you have evidence this is a hotspot, then this does look like a
valid fix.
I don't think it's a hotspot (otherwise it would have probably been
reported long before). I
GitHub user superbobry opened a pull request:
https://github.com/apache/spark/pull/19369
[SPARK-22147][CORE] Removed redundant allocations from BlockId
## What changes were proposed in this pull request?
Prior to this commit BlockId.hashCode and BlockId.equals were defined
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/17746
Thank you.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/17746
@srowen thanks! I've missed the point that 0.13.1 was intentionally merged
into the upcoming release.
@dbtsai could you give an example of the breaking API change between 0.12
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/17746
Hello, are there any plans to backport this into 2.1 branch? The LBFGS and
other fixex in 0.13.1 seem important enough.
---
If your project is set up for it, you can reply to this email
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/17837
Sorry, opened by accident.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user superbobry closed the pull request at:
https://github.com/apache/spark/pull/17837
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature
GitHub user superbobry opened a pull request:
https://github.com/apache/spark/pull/17837
Sync with upstream 2.1
## What changes were proposed in this pull request?
This is a backport of the upstream `branch-2.1`.
## How was this patch tested?
Test upstream
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/16705
Is there a chance to see this is 2.1.1?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user superbobry commented on a diff in the pull request:
https://github.com/apache/spark/pull/17598#discussion_r110950456
--- Diff: core/src/main/scala/org/apache/spark/serializer/Serializer.scala
---
@@ -125,7 +125,7 @@ abstract class SerializerInstance {
* A stream
GitHub user superbobry opened a pull request:
https://github.com/apache/spark/pull/17598
[SPARK-20284][CORE] Make {Des,S}erializationStream extend Closeable
## What changes were proposed in this pull request?
This PR allows to use `SerializationStream
Github user superbobry commented on a diff in the pull request:
https://github.com/apache/spark/pull/17230#discussion_r108351585
--- Diff: core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala ---
@@ -198,17 +183,114 @@ private[spark] class PipedRDD[T: ClassTag](
val
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/17230
Okay, one more benchmark, 4 partitions of [Consumer Complaint
Database](https://catalog.data.gov/dataset/consumer-complaint-database), more
reallistic workload than the sonnets
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/17230
Here're the result of piping [The Sonnets by W.
Sheakespear](https://www.gutenberg.org/cache/epub/1105/pg1105.txt) via `cat`
using either `PipedRDD` or hand-patched `PipedRDD` with I/O formats
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/17230
@holdenk thank you again for the feedback. I want to clarify the on
performance regressions:
> From looking at the description change it would be important to verify
that PySp
Github user superbobry closed the pull request at:
https://github.com/apache/spark/pull/16805
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/16805
Hi @holdenk, apologies, there must've been a rebase quirk. The only change
relevant to this PR is 6d8e7c3e70da04cd8ec3a5bb63b671130df46088. I've opened a
new clean PR #17230, because this one
GitHub user superbobry opened a pull request:
https://github.com/apache/spark/pull/17230
[SPARK-19353][CORE] Generalize PipedRDD to use I/O formats
## What changes were proposed in this pull request?
This patch allows to use arbitrary input and output formats when
GitHub user superbobry opened a pull request:
https://github.com/apache/spark/pull/16805
[SPARK-19353][CORE] Generalize PipedRDD to use I/O formats
## What changes were proposed in this pull request?
This commit allows to use arbitrary input and output formats when
GitHub user superbobry opened a pull request:
https://github.com/apache/spark/pull/16416
[SPARK-19010][CORE] Include Kryo exception in case of overflow
## What changes were proposed in this pull request?
This is to workaround an implicit result of #4947 which suppressed
96 matches
Mail list logo