[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17671 ð --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user kxepal commented on the issue: https://github.com/apache/spark/pull/17671 @holdenk mmm...sweet! That may work and even makes integration process more flexible. Sentry integration wrapper would be trivial with this feature. Thanks! For the future reference: https://github.com/apache/spark/commit/afae8f2bc82597593595af68d1aa2d802210ea8b --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/17671 So now that we have the configurable worker module I think this is no longer needed to be integrated in this way. I'm not sure if Sentry support directly belongs in but at the very least this could be simplified. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17671 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user tedmiston commented on the issue: https://github.com/apache/spark/pull/17671 I checked the mailing list archives but could not find a recent thread on Sentry. Anyone know if there is one I'm missing? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17671 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user kxepal commented on the issue: https://github.com/apache/spark/pull/17671 Ok, will do. Thanks @HyukjinKwon . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17671 Ah, yes. There are good places to discuss. For example, see "Mailing List" - https://spark.apache.org/community.html. I also want to actually see if others think about this. I have done this few times. Please refer, for example, [`[discuss][PySpark] Can we drop support old Pandas (<0.19.2) or what version should we support?`](http://apache-spark-developers-list.1001551.n3.nabble.com/discuss-PySpark-Can-we-drop-support-old-Pandas-lt-0-19-2-or-what-version-should-we-support-td22834.html) and [`[discuss][SQL] Partitioned column type inference proposal`](http://apache-spark-developers-list.1001551.n3.nabble.com/discuss-SQL-Partitioned-column-type-inference-proposal-td22838.html). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user kxepal commented on the issue: https://github.com/apache/spark/pull/17671 @HyukjinKwon Sorry, I'm a bit lost. What email? With a link to this PR to gather more opinions? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17671 I am a Python developer and I use PySpark in production. @kxepal, would you mind sending an email for this discussion in the dev mailing list? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user kxepal commented on the issue: https://github.com/apache/spark/pull/17671 @HyukjinKwon The specific reason is to simplify debugging and errors understanding by integration pyspark with one of the most popular error tracking system among Python developers. E.g. improve user experience. It's not a maintenance stuff, since you never know when and how your production will crash and could you even reproduce that issue to track down and fix the bug. You would like to have this integration on all the time. What you purpose is to do that stuff on application side. How many UDFs I should rewrap to make it works? How many times I should tell newcomers about this custom magic? How many times I should copy-paste that solution between the projects? I think that way doesn't scale well and brings no fun to pyspark development. Especially, when you can do that once on pyspark side with no cost. Could a try that patch with Sentry change your mind about? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17671 Also, there are many good tools Python debuggers, coverage and etc. I think providing a mechanism for such things is a great idea if possible but I doubt each support should be added while workarounds are relatively easy and possible. I personally don't like monkey-patch in general but it's good for working around in few specific cases and think it's fine if it is not excessive and is well understood. PySpark also uses it in few cases. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17671 What's the specific reasons then? I think it's basically a maintenance stuff. I don't think the changes diff will be largely different when it's handled application side. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user kxepal commented on the issue: https://github.com/apache/spark/pull/17671 > If this is the reason to add the support of thirdparty library, it sounds not quite compelling. I think you can even just simply monkey-patch udf or UserDefinedFunction. It wouldn't be too difficult. No, the main reason is to greatly improve debugging experience for pyspark UDFs without a lot of code change. Pyspark worker is a perfect place to handle whose errors. I don't think that monkey-patch is a good way to go. It's basically, hackery, which is unstable and can be eventually broken. And you'll have to copy-paste it from project to project to have good error reporting. Compare this with simply install debugger package on worker side (raven for this PR) and pass at least one configuration option via SparkConfig - that's enough to let all your errors being caught. > I wonder if we could maybe make a mechanism for this that would be useful beyond just sentry but also things like connecting Python debuggers That would be a great, but I'm not familiar with the others error management systems like Sentry. We can start with the few now (Sentry will cover most of Python users) and then figure something else. Like plugins via entry points which are provided by setuptools / pkg_resources - in this case --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17671 Yea, having a mechanism for things like this much more makes sense to me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/17671 So we could do the monkey patching for UDFs but for reduce/aggregate/combineByKey it starts to get a little rough to do. I wonder if we could maybe make a mechanism for this that would be useful beyond just sentry but also things like connecting Python debuggers (since we don't have the same JMX stuff)? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17671 To me, for the similar reason - https://github.com/apache/spark/pull/17671#pullrequestreview-81100230, I think I am less sure of adding this. For the workaround you described @kxepal - https://github.com/apache/spark/pull/17671#issuecomment-298768608, > This was found as very inconvenient way since you'll have to always wrap all your functions. Easy to forget to do. If this is the reason to add the support of thirdparty library support, it sounds not quite compelling. I think you can even just simply monkey-patch `udf` or `UserDefinedFunction`. It wouldn't be too difficult. This gives me an impression that there are many easy workarounds and I would like to avoid adding overhead of maintaining such cases, but rather focuses on more core stuff. It might be best to just share documentation and the monkey patches to mailing list. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17671 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17671 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84476/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17671 **[Test build #84476 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84476/testReport)** for PR 17671 at commit [`3a3301e`](https://github.com/apache/spark/commit/3a3301e820d65524b82024c04c339298eba5071c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17671 **[Test build #84476 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84476/testReport)** for PR 17671 at commit [`3a3301e`](https://github.com/apache/spark/commit/3a3301e820d65524b82024c04c339298eba5071c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17671 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17671 Thanks @holdenk. Sure, I am interested in this. Will take a look shortly within tomorrow. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/17671 Jenkins OK to test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user kxepal commented on the issue: https://github.com/apache/spark/pull/17671 Rebased to resolve conflicts. @holdenk could you take a look please? Is there need something else to do? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user tedmiston commented on the issue: https://github.com/apache/spark/pull/17671 Anyone know what the status of this PR is? I'd really like to be able to run Sentry on my PySpark code in production. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user kxepal commented on the issue: https://github.com/apache/spark/pull/17671 Hm...I'd read about broadcast variables, but never tried to use them. However, after quick look and try, I found that this won't change things too much. Yes, you will be able to pass client instance to all the executors, but still you'll have to modify all the UDF and rest functions to capture exception with sentry client by wrapping all the body with `try: ... except: raven_client.captureException()`. And if we have a lambdas, we'll have to rewrite those completely. In the best case, this could be reduced to some decorator which will take care about all these routines, but still you'll have to remember to use it all the time. And also, you can easily hit the same issue I did by using default threaded Sentry client transport - in some cases it isn't able to send an exception to service before pyspark.worker calls `sys.exit(1)`. Such gotchas quite hard to catch. This way may be good from point of, say, design, but the goal to simplify pyspark developing experience will not be reached in this case. Well, at least, we can have it better. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/17671 Thanks for working on improving the PySpark debugging experience. Perhaps a similar effect be achieved by adding a broadcast variable which lazily loads sentry? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17671 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org