[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user kxepal commented on the issue: https://github.com/apache/spark/pull/17671 @holdenk mmm...sweet! That may work and even makes integration process more flexible. Sentry integration wrapper would be trivial with this feature. Thanks! For the future reference: https://github.com/apache/spark/commit/afae8f2bc82597593595af68d1aa2d802210ea8b --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17671: [SPARK-20368][PYSPARK] Provide optional support f...
Github user kxepal closed the pull request at: https://github.com/apache/spark/pull/17671 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user kxepal commented on the issue: https://github.com/apache/spark/pull/17671 Ok, will do. Thanks @HyukjinKwon . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user kxepal commented on the issue: https://github.com/apache/spark/pull/17671 @HyukjinKwon Sorry, I'm a bit lost. What email? With a link to this PR to gather more opinions? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user kxepal commented on the issue: https://github.com/apache/spark/pull/17671 @HyukjinKwon The specific reason is to simplify debugging and errors understanding by integration pyspark with one of the most popular error tracking system among Python developers. E.g. improve user experience. It's not a maintenance stuff, since you never know when and how your production will crash and could you even reproduce that issue to track down and fix the bug. You would like to have this integration on all the time. What you purpose is to do that stuff on application side. How many UDFs I should rewrap to make it works? How many times I should tell newcomers about this custom magic? How many times I should copy-paste that solution between the projects? I think that way doesn't scale well and brings no fun to pyspark development. Especially, when you can do that once on pyspark side with no cost. Could a try that patch with Sentry change your mind about? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user kxepal commented on the issue: https://github.com/apache/spark/pull/17671 > If this is the reason to add the support of thirdparty library, it sounds not quite compelling. I think you can even just simply monkey-patch udf or UserDefinedFunction. It wouldn't be too difficult. No, the main reason is to greatly improve debugging experience for pyspark UDFs without a lot of code change. Pyspark worker is a perfect place to handle whose errors. I don't think that monkey-patch is a good way to go. It's basically, hackery, which is unstable and can be eventually broken. And you'll have to copy-paste it from project to project to have good error reporting. Compare this with simply install debugger package on worker side (raven for this PR) and pass at least one configuration option via SparkConfig - that's enough to let all your errors being caught. > I wonder if we could maybe make a mechanism for this that would be useful beyond just sentry but also things like connecting Python debuggers That would be a great, but I'm not familiar with the others error management systems like Sentry. We can start with the few now (Sentry will cover most of Python users) and then figure something else. Like plugins via entry points which are provided by setuptools / pkg_resources - in this case --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17671: [SPARK-20368][PYSPARK] Provide optional support f...
Github user kxepal commented on a diff in the pull request: https://github.com/apache/spark/pull/17671#discussion_r154892410 --- Diff: python/pyspark/worker.py --- @@ -160,6 +166,24 @@ def read_udfs(pickleSer, infile, eval_type): def main(infile, outfile): +if raven: --- End diff -- Ah, I get your point. Well, indeed, it's possible to move that branch into the `except Exception` branch. In the end, if exception happens, pyspark worker will get terminated, so raven client is for one time usage and no need to keep it around all the time. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17671: [SPARK-20368][PYSPARK] Provide optional support f...
Github user kxepal commented on a diff in the pull request: https://github.com/apache/spark/pull/17671#discussion_r154890793 --- Diff: python/pyspark/worker.py --- @@ -39,6 +39,12 @@ pickleSer = PickleSerializer() utf8_deserializer = UTF8Deserializer() +try: --- End diff -- Sorry, I'm not familiar with pyspark packaging rules, so would very appreciate for any help here. My motivation to add it as extra was the same as the others are there: if, for instance, pandas is available, you can use pandas-related features of pyspark, but missing pandas should break pyspark. That's why pandas is defined in extras, not install requirements, right? Same is for raven. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17671: [SPARK-20368][PYSPARK] Provide optional support f...
Github user kxepal commented on a diff in the pull request: https://github.com/apache/spark/pull/17671#discussion_r154886723 --- Diff: python/pyspark/worker.py --- @@ -160,6 +166,24 @@ def read_udfs(pickleSer, infile, eval_type): def main(infile, outfile): +if raven: --- End diff -- This adds tiny overhead for worker startup, but not such to care about. The main overhead comes on exception catch and send it to Sentry (HTTP request and traceback formation and etc.), but at that moment you don't actually care about speed since code on worker is already broken and won't be executed anymore. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17671: [SPARK-20368][PYSPARK] Provide optional support f...
Github user kxepal commented on a diff in the pull request: https://github.com/apache/spark/pull/17671#discussion_r154886087 --- Diff: python/pyspark/worker.py --- @@ -39,6 +39,12 @@ pickleSer = PickleSerializer() utf8_deserializer = UTF8Deserializer() +try: --- End diff -- That's what happens here. Otherwise I would have to bring setuptools as runtime dependency to find out if pyspark was installed with sentry extra or not - that's not good idea. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user kxepal commented on the issue: https://github.com/apache/spark/pull/17671 Rebased to resolve conflicts. @holdenk could you take a look please? Is there need something else to do? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...
Github user kxepal commented on the issue: https://github.com/apache/spark/pull/17671 Hm...I'd read about broadcast variables, but never tried to use them. However, after quick look and try, I found that this won't change things too much. Yes, you will be able to pass client instance to all the executors, but still you'll have to modify all the UDF and rest functions to capture exception with sentry client by wrapping all the body with `try: ... except: raven_client.captureException()`. And if we have a lambdas, we'll have to rewrite those completely. In the best case, this could be reduced to some decorator which will take care about all these routines, but still you'll have to remember to use it all the time. And also, you can easily hit the same issue I did by using default threaded Sentry client transport - in some cases it isn't able to send an exception to service before pyspark.worker calls `sys.exit(1)`. Such gotchas quite hard to catch. This way may be good from point of, say, design, but the goal to simplify pyspark developing experience will not be reached in this case. Well, at least, we can have it better. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17671: [SPARK-20368][PYSPARK] Provide optional support f...
GitHub user kxepal opened a pull request: https://github.com/apache/spark/pull/17671 [SPARK-20368][PYSPARK] Provide optional support for Sentry on PySpark workers ## What changes were proposed in this pull request? ### Rationale PySpark allows to use Python functions as UDF and for common transformations like `map` or `filter` calls. Unfortunately, code may contains bugs which leads to exceptions. Some Python exceptions are quite easy to understand and fix, some of them requires to understand overall function context. For instance: ``` TypeError: 'NoneType' object is not subscriptable ``` Ok, we eventually trying to access `None` value by index or key, but why this value became `None`? That was not in our plans. To understand why, to reproduce the problem, you'd like to see how this function was called and in which state were all it locals. Sentry is the one of the systems which captures, stores and classifies tracebacks allowing to easily understand what had gone wrong and quite popular among Python developers. Suddenly, project-wide Sentry configuration cannot be applied to those functions since they are get executed on the remotely, outside project context. So either every function must have a special capture handler or let the PySpark worker take care about everything. ### Motivation While we have this patch applied, locally, I'd like to propose it for upstream. Currently, we have to patch PySpark for every release. Suddenly, we cannot just patch a single file since we have also ensure that this patch will get into pyspark.zip archive, which will be deployed to executors. Suddenly, there is no way found to have a plugin for PySpark worker to avoid any patches. ### Known concerns 1. This adds support for one of many bug tracking systems. That's true. The reason "why Sentry" is that it's very popular system among Python developers and most of them are familiar with. I personally didn't heard about else ones used by Python developers, but if there will be many of them wanted to support PySpark, we can develop something more plug-able solution. ### Possible alternatives You can wrap ALL your functions which will be executed remotely on executors with some decorator, which will provide same Sentry support or throw much more verbose traceback, extracting locals via `inspect` module. This was found as very inconvenient way since you'll have to always wrap all your functions. Easy to forget to do. ### How to use 1. You need to have Sentry client (called raven) available on executors. It may be installed there via system package manager or passed via `sc.addPyFile` as an egg. 2. Pass Sentry DSN via SparkConfig as executor environment variable like: ``` spark.conf.set('spark.executorEnv.SENTRY_DSN', '__DSN__') ``` Additionally, you can configure project release, environment, tags and rest bits via Sentry's environment variables: - SENTRY_ENVIRONMENT - Optional, provide environment your application is running in, like `production` - SENTRY_EXTRA_TAGS - Optional, provide tag names to be extracted from MDC, like `foo,bar,baz` - SENTRY_RELEASE - Optional, provide release version of your application, like `1.0.0` - SENTRY_TAGS - Optional, provide tags like `tag1:value1,tag2:value2` 3. Follow the rest Sentry documentation how to use Sentry if you're not familiar with. ## How was this patch tested? This patch tested manually on local infrastructure. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kxepal/spark 20368-sentry-support-on-pyspark-workers Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17671.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17671 commit 8e9206f2a1c34847efe943afe51b5bdde7298914 Author: Alexander Shorin Date: 2017-04-17T13:25:39Z Provide optional support for Sentry on PySpark workers SPARK-20368 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15961: [SPARK-18523][PySpark]Make SparkContext.stop more reliab...
Github user kxepal commented on the issue: https://github.com/apache/spark/pull/15961 Hooray! ð Thank you all for help here! (: --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15961: [SPARK-18523][PySpark]Make SparkContext.stop more reliab...
Github user kxepal commented on the issue: https://github.com/apache/spark/pull/15961 @holdenk Sure, done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15961: [SPARK-18523][PySpark]Make SparkContext.stop more reliab...
Github user kxepal commented on the issue: https://github.com/apache/spark/pull/15961 @holdenk Agree with you here. The message is fixed, PR rebased. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15961: [SPARK-18523][PySpark]Make SparkContext.stop more reliab...
Github user kxepal commented on the issue: https://github.com/apache/spark/pull/15961 @holdenk @srowen Added a warning message, please take a second look. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15961: [SPARK-18523][PySpark]Make SparkContext.stop more reliab...
Github user kxepal commented on the issue: https://github.com/apache/spark/pull/15961 @holdenk Thanks for a warning message text. Nice one! > I indicated above this swallows all of the Py4J errors and there are a host of things which could cause the Py4J bridge to break down. Suddenly, as you may see in [issue's traceback](https://issues.apache.org/jira/browse/SPARK-18523), it's py4j who raises too general exception on such kind of problem. I was too expected to see there Py4JNetworkError since it's network communication issue, but this didn't happen. The really good exceptions are get swallowed somewhere in the middle and get just printed to stderr via logging, but I'm not sure how to reraise them and how much things this will break. > It seems like the correct action for the user to take when the Py4J bridge breaks is starting over from scratch, either by exiting and re-running their notebook or otherwise re-submitting there job. Yes, that's what happens now: in case of failure we have to shutdown notebook, start it and re-run all the cells again. If we're not running in notebook - crash whole the script. Here comes two issues: 1. Usability. If you made some mistake or Spark job eventually fails, you wouldn't restart whole the notebook, but run cell with `sc.stop` and else cleanup stuff and re-run your Spark cells. That's simple procedure. But when Spark context stop fails, you have to follow plan B: restart all the things and re-run all the cells. That's could be quite boring and actually it is. 2. Correctness. SparkContext is a global shared mutable object and if we cannot correctly reset it state to default to start over that feels like something really wrong here. Should we run all the code that uses SparkContext in subprocesses just to be able to implement retry logic otherwise? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15961: [SPARK-18523][PySpark]Make SparkContext.stop more reliab...
Github user kxepal commented on the issue: https://github.com/apache/spark/pull/15961 @srowen > Why do you say it's so completely unactionable that nobody should know about it? That's just from my experience: in all the cases when drivers dies, it dies in the middle of something, when you do some computations. In this case you already get the Py4J exception on the next operation that triggers communication with JVM process. You already knows about the problem and you're going to fix anyway. So additional reminder on `sc.stop()` was looked redundant for me. > what's the downside to giving this information versus making it impossible to tell that the JVM failed to shut down? Perfect question! Well, I don't have any strong argument here. It seems like a matter of taste about logging information usefulness. Ok, I'll add a warning. Does the following looks good for you?: ```python warning.warn(RuntimeWarning, 'Unable to stop JVM process. Probably, it was crashed or externally killed') ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15961: [SPARK-18523][PySpark]Make SparkContext.stop more reliab...
Github user kxepal commented on the issue: https://github.com/apache/spark/pull/15961 @srowen I was thought I pretty well described that above. TL;DR this information is not useful and you cannot do anything with it, but ignore, imho. But if you insist, I'll add a warning, that's not a problem. Just want to make sure that this is really reasonable to do. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15961: [SPARK-18523][PySpark]Make SparkContext.stop more reliab...
Github user kxepal commented on the issue: https://github.com/apache/spark/pull/15961 @holdenk I was thought about to use warning there, but found that's may useless one. When I do Spark context stop I actually want to achieve either: 1. Clean up things before exit; 2. Start new Spark context with different configuration or just to restart broken one; In both cases I wouldn't care much about underlying JVM process state - I'm shutting down things, it's over, no matter how healthy or broken they are. Warnings are good to take you attention onto some problem and hint you to make some actions to sole them. Like Spark warns you if you pass unknown key to SparkConfig - it's not fatal, but such kind of warning you can fix. In our case I can do nothing with this warning. I could only say "oh, ok", but really there no action could be done to solve that problem. The different case is when JVM process dies in the middle of something, when you're not expect that. Here you'll still get your Py4JError exception with the same not very useful "Connection refused", but in this case you have to take some actions to solve that issue (restart SparkContext, increase driver memory, optimize your code flow etc.). In case of Py4JError on `sc.stop()` much likely you won't have to do anything special and shouldn't. Let me know what you think about. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15361: [SPARK-17765][SQL] Support for writing out user-defined ...
Github user kxepal commented on the issue: https://github.com/apache/spark/pull/15361 @HyukjinKwon Please, do! Thanks a lot for helping here (: --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15361: [SPARK-17765][SQL] Support for writing out user-defined ...
Github user kxepal commented on the issue: https://github.com/apache/spark/pull/15361 @HyukjinKwon May be we can reach someone else with commit bit? Do you know anyone to ping? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15961: [SPARK-18523][PySpark]Make SparkContext.stop more...
GitHub user kxepal opened a pull request: https://github.com/apache/spark/pull/15961 [SPARK-18523][PySpark]Make SparkContext.stop more reliable ## What changes were proposed in this pull request? This PR fixes SparkContext broken state in which it may fall if spark driver get crashed or killed by OOM. ## How was this patch tested? 1. Start SparkContext; 2. Find Spark driver process and `kill -9` it; 3. Call `sc.stop()`; 4. Create new SparkContext after that; Without this patch you will crash on step 3 and won't be able to do step 4 without manual reset private attibutes or IPython notebook / shell restart. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kxepal/spark 18523-make-spark-context-stop-more-reliable Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15961.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15961 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15361: [SPARK-17765][SQL] Support for writing out user-defined ...
Github user kxepal commented on the issue: https://github.com/apache/spark/pull/15361 @HyukjinKwon It works great! Thank you! My mistake was by applying changes for the same `wrapperFor` method, while for 2.0.0 sources state it have to be placed in `wrap` method instead with a bit modification to pass third argument in recursive call. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15361: [SPARK-17765][SQL] Support for writing out user-defined ...
Github user kxepal commented on the issue: https://github.com/apache/spark/pull/15361 @HyukjinKwon Oh, great news! It seems it's me backported this patch to 2.0.0 incorrectly. I'm sorry for false alarm then - suddenly, I wasn't able to test it with master. I'll do one more try today, but so far it looks like that you solved the problem \o/ Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15361: [SPARK-17765][SQL] Support for writing out user-defined ...
Github user kxepal commented on the issue: https://github.com/apache/spark/pull/15361 @HyukjinKwon Thank you a lot! Staying tuned. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15361: [SPARK-17765][SQL] Support for writing out user-defined ...
Github user kxepal commented on the issue: https://github.com/apache/spark/pull/15361 @HyukjinKwon Ok, try something like this: ``` scala> val sv = org.apache.spark.mllib.linalg.Vectors.sparse(7, Array(0, 42), Array(-127, 128)) sv: org.apache.spark.mllib.linalg.Vector = (7,[0,42],[-127.0,128.0]) scala> val df = Seq(("thing", sv)).toDF("thing", "vector") df: org.apache.spark.sql.DataFrame = [thing: string, vector: vector] scala> df.write.format("orc").save("/tmp/thing.orc") ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15361: [SPARK-17765][SQL] Support for writing out user-defined ...
Github user kxepal commented on the issue: https://github.com/apache/spark/pull/15361 @HyukjinKwon Thanks for the patch, but suddenly it doesn't solves the issue. Tested with 2.0.0 Spark: ``` Caused by: java.lang.ClassCastException: org.apache.spark.mllib.linalg.VectorUDT cannot be cast to org.apache.spark.sql.types.StructType at org.apache.spark.sql.hive.HiveInspectors$class.wrap(HiveInspectors.scala:558) at org.apache.spark.sql.hive.orc.OrcSerializer.wrap(OrcFileFormat.scala:164) at org.apache.spark.sql.hive.orc.OrcSerializer.wrapOrcStruct(OrcFileFormat.scala:202) at org.apache.spark.sql.hive.orc.OrcSerializer.serialize(OrcFileFormat.scala:168) at org.apache.spark.sql.hive.orc.OrcOutputWriter.writeInternal(OrcFileFormat.scala:253) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:255) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258) ``` Let me try to make simple scala test case that reproduces the issue from shell. May be this will be more helpful. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org