[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-11-12 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/23008 Interestingly, `cloudpickle` adds overhead even if the namedtuple is importable: ```bash $ cat a.py from collections import namedtuple A = namedtuple("A", [&

[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-11-12 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/23008 Is there a benchmark suite for PySpark? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-11-11 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 Closing this PR to continue the discussion in the new one. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark pull request #21157: [SPARK-22674][PYTHON] Removed the namedtuple pick...

2018-11-11 Thread superbobry
Github user superbobry closed the pull request at: https://github.com/apache/spark/pull/21157 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-11-11 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 @HyukjinKwon done in #23008. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e

[GitHub] spark pull request #23008: [SPARK-22674][PYTHON] Removed the namedtuple pick...

2018-11-11 Thread superbobry
GitHub user superbobry opened a pull request: https://github.com/apache/spark/pull/23008 [SPARK-22674][PYTHON] Removed the namedtuple pickling patch ## What changes were proposed in this pull request? Prior to this PR PySpark patched ``collections.namedtuple`` to make

[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-10-27 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 > I think people do defined NamedTuples in Notebooks, so I'm going to stick with -1. @holdenk I understand your point, but there is still something we can do without breaking exist

[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-10-27 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 @HyukjinKwon do you mean change the default serializer to cloudpickle and remove _hack_namedtuple? --- - To unsubscribe, e

[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-10-21 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 `cloudpickle` does indeed support pickling namedtuples. Maybe the way to go is to remove the patch advertise `cloudpickle` serializer for projects relying on the old behaviour. Wdyt @holdenk

[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-10-20 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 > If removing the hack entirely is going to brake named tuples defined in the repl I'm a -1 on that change. Yes, but it might be OK for two reasons: people rarely define namedtup

[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-10-13 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 Yes, it will break IPython notebooks as well. I wonder how often people actually defined namedtuples in a notebook? Emitting a warning is a less extreme option, yes

[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-10-12 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 Yes, that is correct. That is why I think hijacking behaviour should be removed. It silently slows down the job and does not notify the user that a trivial change such as making the namedtuple

[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-10-12 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 Nope, the job I was referring to is not open source; but I guess the speedup is easy to justify: much less payload and faster deserialization: ``` >>> from collectio

[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-10-04 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 > Is it possible to keep the current hack for things which can't be pickled, but remove the hack in the situation where the namedtuple is well behaved and it could be pickled direc

[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-10-01 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 @rxin the ones mentioned as 1 and 2 in the PR description: https://superbobry.github.io/pyspark-silently-breaks-your-namedtuples.html https://superbobry.github.io/tensorflowonspark

[GitHub] spark pull request #21157: [SPARK-22674][PYTHON] Removed the namedtuple pick...

2018-09-28 Thread superbobry
Github user superbobry commented on a diff in the pull request: https://github.com/apache/spark/pull/21157#discussion_r221229392 --- Diff: python/pyspark/mllib/fpm.py --- @@ -94,12 +100,9 @@ def train(cls, data, minSupport=0.3, numPartitions=-1): model = callMLlibFunc

[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-09-28 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 > so this change would introduce a pretty big regression? The change does introduce a regression as some namedtuples will become unpicklable. However, it makes pickling in PySpark m

[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-09-27 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 > So, real downside of removing this now is we disallow global scope namedtuple. Importable namedtuples and their subclasses could still be used inside an RDD. Only the namedtup

[GitHub] spark pull request #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple f...

2018-09-27 Thread superbobry
Github user superbobry closed the pull request at: https://github.com/apache/spark/pull/21180 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark issue #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple for pick...

2018-09-27 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21180 @HyukjinKwon I've reopened the PR doing this and rebased it on the latest master, see #21157. --- - To unsubscribe, e-mail

[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-09-27 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 Reopened and rebased to be merged into the 3.X branch. See discussion in #21180. --- - To unsubscribe, e-mail: reviews

[GitHub] spark pull request #21157: [SPARK-22674][PYTHON] Removed the namedtuple pick...

2018-09-27 Thread superbobry
GitHub user superbobry reopened a pull request: https://github.com/apache/spark/pull/21157 [SPARK-22674][PYTHON] Removed the namedtuple pickling patch ## What changes were proposed in this pull request? This is a breaking change. Prior to this commit PySpark

[GitHub] spark issue #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple for pick...

2018-09-09 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21180 @HyukjinKwon sure, should I just open a new PR against master or some other branch? --- - To unsubscribe, e-mail: reviews

[GitHub] spark issue #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple for pick...

2018-08-20 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21180 > Yea the next will likely be 3.0.0 and actually many fixes are being targeted to 3.0.0 (even one of mine). Awesome, I'm OK to wait until 3.0.0 then. Is there a release date alre

[GitHub] spark issue #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple for pick...

2018-08-20 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21180 The only workaround I know is to manually remove the __reduce__ method for each namedtuple class in use. This has to be done on both the driver and the executors. > The fix itself lo

[GitHub] spark issue #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple for pick...

2018-08-20 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21180 @HyukjinKwon could you reference other PySpark maintainers who could do a review and share their thoughts

[GitHub] spark issue #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple for pick...

2018-08-14 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21180 @HyukjinKwon do you think asking for feedback on the dev mailing list might help decide, whether it is OK to merge it in 2.X? I want to point out that the PR * does not break any

[GitHub] spark issue #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple for pick...

2018-08-08 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21180 Thanks for getting back and no worries about the delay. > [...] can we target 3.0.0 for this? I don't have enough nerve to push this in to fix this hack within 2.4.0 since the c

[GitHub] spark issue #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple for pick...

2018-08-07 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21180 Sorry to bug you @HyukjinKwon, but I would really like for this patch to make it into the next PySpark release. Would you have time in the following weeks to have another look

[GitHub] spark issue #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple for pick...

2018-07-12 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21180 Hey @HyukjinKwon, apologies for yet another ping. Should I ask for someone else to review the PR on the spark-dev mailing list

[GitHub] spark issue #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple for pick...

2018-06-06 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21180 Friendly ping @HyukjinKwon. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e

[GitHub] spark issue #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple for pick...

2018-05-28 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21180 I agree that in theory, the use of `_getframe` is somewhat limiting, but I doubt it would be a major concern in practice. Python does not have a standard, so the best we can hope for, as far

[GitHub] spark issue #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple for pick...

2018-05-28 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21180 > I think you know the current approach doesn't look safe, right? Could you elaborate on that, ple

[GitHub] spark issue #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple for pick...

2018-05-28 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21180 @HyukjinKwon, @felixcheung did you have the time to look into this? If you disagree with the proposed implementation, I think we should at least document the current behaviour somewhere

[GitHub] spark issue #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple for pick...

2018-05-14 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21180 Hey @HyukjinKwon and @felixcheung, do you think the PR is good to be merged as-is, or would you like me to think further about how to make it more robust

[GitHub] spark issue #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple for pick...

2018-05-06 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21180 Hi @HyukjinKwon, could you kindly have a look at the new version? It is backwards compatible and has the same robustness guarantees as `collections.namedtuple` from CPython

[GitHub] spark pull request #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple f...

2018-04-29 Thread superbobry
Github user superbobry commented on a diff in the pull request: https://github.com/apache/spark/pull/21180#discussion_r184889165 --- Diff: python/pyspark/serializers.py --- @@ -523,7 +523,21 @@ def namedtuple(*args, **kwargs): for k, v

[GitHub] spark pull request #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple f...

2018-04-28 Thread superbobry
Github user superbobry commented on a diff in the pull request: https://github.com/apache/spark/pull/21180#discussion_r184860643 --- Diff: python/pyspark/serializers.py --- @@ -523,7 +523,21 @@ def namedtuple(*args, **kwargs): for k, v

[GitHub] spark pull request #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple f...

2018-04-27 Thread superbobry
GitHub user superbobry opened a pull request: https://github.com/apache/spark/pull/21180 [SPARK-22674][PYTHON] Disabled _hack_namedtuple for picklable namedtuples Prior to this PR ``_hack_namedtuple`` was applied to all namedtuples, regardless if they were defined on the top

[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-04-27 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 Closing in favour of #21180. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e

[GitHub] spark pull request #21157: [SPARK-22674][PYTHON] Removed the namedtuple pick...

2018-04-27 Thread superbobry
Github user superbobry closed the pull request at: https://github.com/apache/spark/pull/21157 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-04-26 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 One improvement we can make is change the patch to bypass namedtuples which are importable. This would resolve the issues with namedtuples coming from third-party libraries. I can open a new PR

[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-04-26 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 Yes, we can backport some of the cloudpickle code to make the patch less fragile. This would be a nontrivial change in an already complex code, but I'd be happy to sketch this if there's

[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-04-26 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 > Does the test even pass? The tests should pass module the tests specifically checking the behaviour being removed. I think the failing RDD test is in this group as w

[GitHub] spark pull request #21157: [SPARK-22674][PYTHON] Removed the namedtuple pick...

2018-04-25 Thread superbobry
GitHub user superbobry opened a pull request: https://github.com/apache/spark/pull/21157 [SPARK-22674][PYTHON] Removed the namedtuple pickling patch ## What changes were proposed in this pull request? This is a breaking change. Prior to this commit PySpark patched

[GitHub] spark pull request #19992: [SPARK-22805][CORE] Use StorageLevel aliases in e...

2018-01-22 Thread superbobry
Github user superbobry closed the pull request at: https://github.com/apache/spark/pull/19992 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark issue #19992: [SPARK-22805][CORE] Use StorageLevel aliases in event lo...

2018-01-20 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/19992 @squito I think it's fine to just close the PR/JIRA issue. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark pull request #19992: [SPARK-22805][CORE] Use StorageLevel aliases in e...

2018-01-01 Thread superbobry
Github user superbobry commented on a diff in the pull request: https://github.com/apache/spark/pull/19992#discussion_r159153806 --- Diff: core/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala --- @@ -2022,12 +1947,7 @@ private[spark] object JsonProtocolSuite extends

[GitHub] spark pull request #19992: [SPARK-22805][CORE] Use StorageLevel aliases in e...

2017-12-26 Thread superbobry
Github user superbobry commented on a diff in the pull request: https://github.com/apache/spark/pull/19992#discussion_r158701191 --- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala --- @@ -444,12 +444,15 @@ private[spark] object JsonProtocol { ("

[GitHub] spark pull request #19992: [SPARK-22805][CORE] Use StorageLevel aliases in e...

2017-12-26 Thread superbobry
Github user superbobry commented on a diff in the pull request: https://github.com/apache/spark/pull/19992#discussion_r158689120 --- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala --- @@ -988,12 +991,20 @@ private[spark] object JsonProtocol { rddInfo

[GitHub] spark pull request #19992: [SPARK-22805][CORE] Use StorageLevel aliases in e...

2017-12-26 Thread superbobry
Github user superbobry commented on a diff in the pull request: https://github.com/apache/spark/pull/19992#discussion_r158688956 --- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala --- @@ -444,12 +444,15 @@ private[spark] object JsonProtocol { ("

[GitHub] spark pull request #19992: [SPARK-22805][CORE] Use StorageLevel aliases in e...

2017-12-20 Thread superbobry
Github user superbobry commented on a diff in the pull request: https://github.com/apache/spark/pull/19992#discussion_r158154921 --- Diff: core/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala --- @@ -2022,12 +1947,7 @@ private[spark] object JsonProtocolSuite extends

[GitHub] spark pull request #19992: [SPARK-22805][CORE] Use StorageLevel aliases in e...

2017-12-20 Thread superbobry
Github user superbobry commented on a diff in the pull request: https://github.com/apache/spark/pull/19992#discussion_r158028287 --- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala --- @@ -444,12 +444,15 @@ private[spark] object JsonProtocol { ("

[GitHub] spark issue #19992: [SPARK-22805][CORE] Use StorageLevel aliases in event lo...

2017-12-18 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/19992 Can someone have a look at the tests, please? I can't see the failure (and in theory, the change should not affect SparkR

[GitHub] spark issue #19992: [SPARK-22805][CORE] Use StorageLevel aliases in event lo...

2017-12-16 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/19992 Minor update: I've simulated #18162 on one of our 80G event logs and (unless there is a bug in the filtering code) the log shrank to 157M. The effect of this patch was almost negligible

[GitHub] spark pull request #19992: [SPARK-22805][CORE] Use StorageLevel aliases in e...

2017-12-16 Thread superbobry
Github user superbobry commented on a diff in the pull request: https://github.com/apache/spark/pull/19992#discussion_r157340550 --- Diff: core/src/main/scala/org/apache/spark/storage/StorageLevel.scala --- @@ -132,6 +132,23 @@ class StorageLevel private( override def

[GitHub] spark pull request #19992: [SPARK-22805][CORE] Use StorageLevel aliases in e...

2017-12-15 Thread superbobry
Github user superbobry commented on a diff in the pull request: https://github.com/apache/spark/pull/19992#discussion_r157311572 --- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala --- @@ -988,12 +997,16 @@ private[spark] object JsonProtocol { rddInfo

[GitHub] spark pull request #19992: [SPARK-22805][CORE] Use StorageLevel aliases in e...

2017-12-15 Thread superbobry
Github user superbobry commented on a diff in the pull request: https://github.com/apache/spark/pull/19992#discussion_r157311547 --- Diff: core/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala --- @@ -2022,12 +1947,7 @@ private[spark] object JsonProtocolSuite extends

[GitHub] spark pull request #19992: [SPARK-22805][CORE] Use StorageLevel aliases in e...

2017-12-15 Thread superbobry
Github user superbobry commented on a diff in the pull request: https://github.com/apache/spark/pull/19992#discussion_r157257388 --- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala --- @@ -444,11 +444,20 @@ private[spark] object JsonProtocol { ("

[GitHub] spark pull request #19992: [SPARK-22805][CORE] Use StorageLevel aliases in e...

2017-12-15 Thread superbobry
GitHub user superbobry opened a pull request: https://github.com/apache/spark/pull/19992 [SPARK-22805][CORE] Use StorageLevel aliases in event logs The format of event logs uses redundant representation for storage levels, for instance StorageLevel.DISK_ONLY is represented

[GitHub] spark pull request #17230: [SPARK-19353][CORE] Generalize PipedRDD to use I/...

2017-12-07 Thread superbobry
Github user superbobry closed the pull request at: https://github.com/apache/spark/pull/17230 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

2017-10-25 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/19458 @jerryshao the error looks a bit unrelated. ``` sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: null ``` The suite passes locally for me

[GitHub] spark pull request #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks...

2017-10-24 Thread superbobry
Github user superbobry commented on a diff in the pull request: https://github.com/apache/spark/pull/19458#discussion_r146491977 --- Diff: core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala --- @@ -100,7 +100,17 @@ private[spark] class DiskBlockManager(conf

[GitHub] spark pull request #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks...

2017-10-23 Thread superbobry
Github user superbobry commented on a diff in the pull request: https://github.com/apache/spark/pull/19458#discussion_r146213702 --- Diff: core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala --- @@ -100,7 +100,16 @@ private[spark] class DiskBlockManager(conf

[GitHub] spark pull request #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks...

2017-10-13 Thread superbobry
Github user superbobry commented on a diff in the pull request: https://github.com/apache/spark/pull/19458#discussion_r144655974 --- Diff: core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala --- @@ -100,7 +100,16 @@ private[spark] class DiskBlockManager(conf

[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

2017-10-12 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/19458 On second thought, the documentation of `DiskBlockManager.getAllFiles` states that it returns all files in the directories managed by `DiskBlockManager`. Filtering anything would violate

[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

2017-10-12 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/19458 > Instead of filtering out temp blocks, why not adding parsing rule for `TempLocalBlockId` and `TempShuffleBlockId`? Because this fixes the issue out 2 out of 3 possible temp fi

[GitHub] spark pull request #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks...

2017-10-09 Thread superbobry
Github user superbobry commented on a diff in the pull request: https://github.com/apache/spark/pull/19458#discussion_r143586763 --- Diff: core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala --- @@ -100,7 +102,9 @@ private[spark] class DiskBlockManager(conf

[GitHub] spark pull request #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks...

2017-10-09 Thread superbobry
GitHub user superbobry opened a pull request: https://github.com/apache/spark/pull/19458 [SPARK-7][CORE] DiskBlockManager.getAllBlocks now tolerates temp files ## What changes were proposed in this pull request? Prior to this commit getAllBlocks implicitly assumed

[GitHub] spark issue #19369: [SPARK-22147][CORE] Removed redundant allocations from B...

2017-10-05 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/19369 I've fixed the failing `DiskStoreSuite` and ensured the other two suites also pass fine. --- - To unsubscribe, e-mail

[GitHub] spark pull request #19369: [SPARK-22147][CORE] Removed redundant allocations...

2017-10-03 Thread superbobry
Github user superbobry commented on a diff in the pull request: https://github.com/apache/spark/pull/19369#discussion_r142392349 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockReplicationPolicy.scala --- @@ -85,11 +65,9 @@ object BlockReplicationUtils

[GitHub] spark pull request #19369: [SPARK-22147][CORE] Removed redundant allocations...

2017-09-28 Thread superbobry
Github user superbobry commented on a diff in the pull request: https://github.com/apache/spark/pull/19369#discussion_r141672528 --- Diff: core/src/main/scala/org/apache/spark/storage/DiskStore.scala --- @@ -67,7 +67,7 @@ private[spark] class DiskStore( var threwException

[GitHub] spark pull request #19369: [SPARK-22147][CORE] Removed redundant allocations...

2017-09-28 Thread superbobry
Github user superbobry commented on a diff in the pull request: https://github.com/apache/spark/pull/19369#discussion_r141671388 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockReplicationPolicy.scala --- @@ -85,11 +65,9 @@ object BlockReplicationUtils

[GitHub] spark pull request #19369: [SPARK-22147][CORE] Removed redundant allocations...

2017-09-28 Thread superbobry
Github user superbobry commented on a diff in the pull request: https://github.com/apache/spark/pull/19369#discussion_r141670392 --- Diff: core/src/main/scala/org/apache/spark/storage/DiskStore.scala --- @@ -49,7 +49,7 @@ private[spark] class DiskStore( private val

[GitHub] spark issue #19369: [SPARK-22147][CORE] Removed redundant allocations from B...

2017-09-28 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/19369 I've looked at the allocation profile of the sample, it does indeed contain strings from other sources, so the heap histogram is not representative. However, the allocations from

[GitHub] spark issue #19369: [SPARK-22147][CORE] Removed redundant allocations from B...

2017-09-27 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/19369 Will have a look at the failing tests tomorrow and also have a look at the origins of the strings in the sample above

[GitHub] spark issue #19369: [SPARK-22147][CORE] Removed redundant allocations from B...

2017-09-27 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/19369 > Yes, if you have evidence this is a hotspot, then this does look like a valid fix. I don't think it's a hotspot (otherwise it would have probably been reported long before). I

[GitHub] spark pull request #19369: [SPARK-22147][CORE] Removed redundant allocations...

2017-09-27 Thread superbobry
GitHub user superbobry opened a pull request: https://github.com/apache/spark/pull/19369 [SPARK-22147][CORE] Removed redundant allocations from BlockId ## What changes were proposed in this pull request? Prior to this commit BlockId.hashCode and BlockId.equals were defined

[GitHub] spark issue #17746: [SPARK-20449][ML] Upgrade breeze version to 0.13.1

2017-05-17 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/17746 Thank you. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark issue #17746: [SPARK-20449][ML] Upgrade breeze version to 0.13.1

2017-05-16 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/17746 @srowen thanks! I've missed the point that 0.13.1 was intentionally merged into the upcoming release. @dbtsai could you give an example of the breaking API change between 0.12

[GitHub] spark issue #17746: [SPARK-20449][ML] Upgrade breeze version to 0.13.1

2017-05-16 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/17746 Hello, are there any plans to backport this into 2.1 branch? The LBFGS and other fixex in 0.13.1 seem important enough. --- If your project is set up for it, you can reply to this email

[GitHub] spark issue #17837: Sync with upstream 2.1

2017-05-02 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/17837 Sorry, opened by accident. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request #17837: Sync with upstream 2.1

2017-05-02 Thread superbobry
Github user superbobry closed the pull request at: https://github.com/apache/spark/pull/17837 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request #17837: Sync with upstream 2.1

2017-05-02 Thread superbobry
GitHub user superbobry opened a pull request: https://github.com/apache/spark/pull/17837 Sync with upstream 2.1 ## What changes were proposed in this pull request? This is a backport of the upstream `branch-2.1`. ## How was this patch tested? Test upstream

[GitHub] spark issue #16705: [SPARK-19354] [Core] Killed tasks are getting marked as ...

2017-04-24 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/16705 Is there a chance to see this is 2.1.1? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark pull request #17598: [SPARK-20284][CORE] Make {Des,S}erializationStrea...

2017-04-11 Thread superbobry
Github user superbobry commented on a diff in the pull request: https://github.com/apache/spark/pull/17598#discussion_r110950456 --- Diff: core/src/main/scala/org/apache/spark/serializer/Serializer.scala --- @@ -125,7 +125,7 @@ abstract class SerializerInstance { * A stream

[GitHub] spark pull request #17598: [SPARK-20284][CORE] Make {Des,S}erializationStrea...

2017-04-10 Thread superbobry
GitHub user superbobry opened a pull request: https://github.com/apache/spark/pull/17598 [SPARK-20284][CORE] Make {Des,S}erializationStream extend Closeable ## What changes were proposed in this pull request? This PR allows to use `SerializationStream

[GitHub] spark pull request #17230: [SPARK-19353][CORE] Generalize PipedRDD to use I/...

2017-03-28 Thread superbobry
Github user superbobry commented on a diff in the pull request: https://github.com/apache/spark/pull/17230#discussion_r108351585 --- Diff: core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala --- @@ -198,17 +183,114 @@ private[spark] class PipedRDD[T: ClassTag]( val

[GitHub] spark issue #17230: [SPARK-19353][CORE] Generalize PipedRDD to use I/O forma...

2017-03-16 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/17230 Okay, one more benchmark, 4 partitions of [Consumer Complaint Database](https://catalog.data.gov/dataset/consumer-complaint-database), more reallistic workload than the sonnets

[GitHub] spark issue #17230: [SPARK-19353][CORE] Generalize PipedRDD to use I/O forma...

2017-03-16 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/17230 Here're the result of piping [The Sonnets by W. Sheakespear](https://www.gutenberg.org/cache/epub/1105/pg1105.txt) via `cat` using either `PipedRDD` or hand-patched `PipedRDD` with I/O formats

[GitHub] spark issue #17230: [SPARK-19353][CORE] Generalize PipedRDD to use I/O forma...

2017-03-09 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/17230 @holdenk thank you again for the feedback. I want to clarify the on performance regressions: > From looking at the description change it would be important to verify that PySp

[GitHub] spark pull request #16805: [SPARK-19353][CORE] Generalize PipedRDD to use I/...

2017-03-09 Thread superbobry
Github user superbobry closed the pull request at: https://github.com/apache/spark/pull/16805 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark issue #16805: [SPARK-19353][CORE] Generalize PipedRDD to use I/O forma...

2017-03-09 Thread superbobry
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/16805 Hi @holdenk, apologies, there must've been a rebase quirk. The only change relevant to this PR is 6d8e7c3e70da04cd8ec3a5bb63b671130df46088. I've opened a new clean PR #17230, because this one

[GitHub] spark pull request #17230: [SPARK-19353][CORE] Generalize PipedRDD to use I/...

2017-03-09 Thread superbobry
GitHub user superbobry opened a pull request: https://github.com/apache/spark/pull/17230 [SPARK-19353][CORE] Generalize PipedRDD to use I/O formats ## What changes were proposed in this pull request? This patch allows to use arbitrary input and output formats when

[GitHub] spark pull request #16805: [SPARK-19353][CORE] Generalize PipedRDD to use I/...

2017-02-04 Thread superbobry
GitHub user superbobry opened a pull request: https://github.com/apache/spark/pull/16805 [SPARK-19353][CORE] Generalize PipedRDD to use I/O formats ## What changes were proposed in this pull request? This commit allows to use arbitrary input and output formats when

[GitHub] spark pull request #16416: [SPARK-19010][CORE] Include Kryo exception in cas...

2016-12-27 Thread superbobry
GitHub user superbobry opened a pull request: https://github.com/apache/spark/pull/16416 [SPARK-19010][CORE] Include Kryo exception in case of overflow ## What changes were proposed in this pull request? This is to workaround an implicit result of #4947 which suppressed