[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...

2018-02-28 Thread kxepal
Github user kxepal commented on the issue:

https://github.com/apache/spark/pull/17671
  
@holdenk 
mmm...sweet! That may work and even makes integration process more 
flexible. Sentry integration wrapper would be trivial with this feature. 
Thanks! 

For the future reference: 
https://github.com/apache/spark/commit/afae8f2bc82597593595af68d1aa2d802210ea8b


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17671: [SPARK-20368][PYSPARK] Provide optional support f...

2018-02-28 Thread kxepal
Github user kxepal closed the pull request at:

https://github.com/apache/spark/pull/17671


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...

2017-12-05 Thread kxepal
Github user kxepal commented on the issue:

https://github.com/apache/spark/pull/17671
  
Ok, will do. Thanks @HyukjinKwon .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...

2017-12-05 Thread kxepal
Github user kxepal commented on the issue:

https://github.com/apache/spark/pull/17671
  
@HyukjinKwon 
Sorry, I'm a bit lost. What email? With a link to this PR to gather more 
opinions? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...

2017-12-05 Thread kxepal
Github user kxepal commented on the issue:

https://github.com/apache/spark/pull/17671
  
@HyukjinKwon 
The specific reason is to simplify debugging and errors understanding by 
integration pyspark with one of the most popular error tracking system among 
Python developers. E.g. improve user experience. 

It's not a maintenance stuff, since you never know when and how your 
production will crash and could you even reproduce that issue to track down and 
fix the bug. You would like to have this integration on all the time.

What you purpose is to do that stuff on application side. How many UDFs I 
should rewrap to make it works? How many times I should tell newcomers about 
this custom magic? How many times I should copy-paste that solution between the 
projects? I think that way doesn't scale well and brings no fun to pyspark 
development. Especially, when you can do that once on pyspark side with no cost.

Could a try that patch with Sentry change your mind about?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...

2017-12-05 Thread kxepal
Github user kxepal commented on the issue:

https://github.com/apache/spark/pull/17671
  
> If this is the reason to add the support of thirdparty library, it sounds 
not quite compelling. I think you can even just simply monkey-patch udf or 
UserDefinedFunction. It wouldn't be too difficult.

No, the main reason is to greatly improve debugging experience for pyspark 
UDFs without a lot of code change. Pyspark worker is a perfect place to handle 
whose errors. 

I don't think that monkey-patch is a good way to go. It's basically, 
hackery, which is unstable and can be eventually broken. And you'll have to 
copy-paste it from project to project to have good error reporting. 

Compare this with simply install debugger package on worker side (raven for 
this PR) and pass at least one configuration option via SparkConfig - that's 
enough to let all your errors being caught.

> I wonder if we could maybe make a mechanism for this that would be useful 
beyond just sentry but also things like connecting Python debuggers

That would be a great, but I'm not familiar with the others error 
management systems like Sentry. We can start with the few now (Sentry will 
cover most of Python users) and then figure something else. Like plugins via 
entry points which are provided by setuptools / pkg_resources - in this case 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17671: [SPARK-20368][PYSPARK] Provide optional support f...

2017-12-05 Thread kxepal
Github user kxepal commented on a diff in the pull request:

https://github.com/apache/spark/pull/17671#discussion_r154892410
  
--- Diff: python/pyspark/worker.py ---
@@ -160,6 +166,24 @@ def read_udfs(pickleSer, infile, eval_type):
 
 
 def main(infile, outfile):
+if raven:
--- End diff --

Ah, I get your point. Well, indeed, it's possible to move that branch into 
the `except Exception` branch. In the end, if exception happens, pyspark worker 
will get terminated, so raven client is for one time usage and no need to keep 
it around all the time.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17671: [SPARK-20368][PYSPARK] Provide optional support f...

2017-12-05 Thread kxepal
Github user kxepal commented on a diff in the pull request:

https://github.com/apache/spark/pull/17671#discussion_r154890793
  
--- Diff: python/pyspark/worker.py ---
@@ -39,6 +39,12 @@
 pickleSer = PickleSerializer()
 utf8_deserializer = UTF8Deserializer()
 
+try:
--- End diff --

Sorry, I'm not familiar with pyspark packaging rules, so would very 
appreciate for any help here. 

My motivation to add it as extra was the same as the others are there: if, 
for instance, pandas is available, you can use pandas-related features of 
pyspark, but missing pandas should break pyspark. That's why pandas is defined 
in extras, not install requirements, right? Same is for raven.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17671: [SPARK-20368][PYSPARK] Provide optional support f...

2017-12-05 Thread kxepal
Github user kxepal commented on a diff in the pull request:

https://github.com/apache/spark/pull/17671#discussion_r154886723
  
--- Diff: python/pyspark/worker.py ---
@@ -160,6 +166,24 @@ def read_udfs(pickleSer, infile, eval_type):
 
 
 def main(infile, outfile):
+if raven:
--- End diff --

This adds tiny overhead for worker startup, but not such to care about. The 
main overhead comes on exception catch and send it to Sentry (HTTP request and 
traceback formation and etc.), but at that moment you don't actually care about 
speed since code on worker is already broken and won't be executed anymore.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17671: [SPARK-20368][PYSPARK] Provide optional support f...

2017-12-05 Thread kxepal
Github user kxepal commented on a diff in the pull request:

https://github.com/apache/spark/pull/17671#discussion_r154886087
  
--- Diff: python/pyspark/worker.py ---
@@ -39,6 +39,12 @@
 pickleSer = PickleSerializer()
 utf8_deserializer = UTF8Deserializer()
 
+try:
--- End diff --

That's what happens here. Otherwise I would have to bring setuptools as 
runtime dependency to find out if pyspark was installed with sentry extra or 
not - that's not good idea.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...

2017-12-05 Thread kxepal
Github user kxepal commented on the issue:

https://github.com/apache/spark/pull/17671
  
Rebased to resolve conflicts.

@holdenk could you take a look please? Is there need something else to do?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17671: [SPARK-20368][PYSPARK] Provide optional support for Sent...

2017-05-02 Thread kxepal
Github user kxepal commented on the issue:

https://github.com/apache/spark/pull/17671
  
Hm...I'd read about broadcast variables, but never tried to use them. 
However, after quick look and try, I found that this won't change things too 
much. 

Yes, you will be able to pass client instance to all the executors, but 
still you'll have to modify all the UDF and rest functions to capture exception 
with sentry client by wrapping all the body with `try: ... except: 
raven_client.captureException()`. And if we have a lambdas, we'll have to 
rewrite those completely. 

In the best case, this could be reduced to some decorator which will take 
care about all these routines, but still you'll have to remember to use it all 
the time. And also, you can easily hit the same issue I did by using default 
threaded Sentry client transport - in some cases it isn't able to send an 
exception to service before pyspark.worker calls `sys.exit(1)`. Such gotchas 
quite hard to catch.

This way may be good from point of, say, design, but the goal to simplify 
pyspark developing experience will not be reached in this case. Well, at least, 
we can have it better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17671: [SPARK-20368][PYSPARK] Provide optional support f...

2017-04-18 Thread kxepal
GitHub user kxepal opened a pull request:

https://github.com/apache/spark/pull/17671

[SPARK-20368][PYSPARK] Provide optional support for Sentry on PySpark 
workers

## What changes were proposed in this pull request?

### Rationale

PySpark allows to use Python functions as UDF and for common 
transformations like `map` or `filter` calls. Unfortunately, code may
contains bugs which leads to exceptions. Some Python exceptions are 
quite easy to understand and fix, some of them requires to understand
overall function context. For instance:

```
TypeError: 'NoneType' object is not subscriptable
```

Ok, we eventually trying to access `None` value by index or key, but
why this value became `None`? That was not in our plans. To understand
why, to reproduce the problem, you'd like to see how this function was
called and in which state were all it locals.

Sentry is the one of the systems which captures, stores and classifies 
tracebacks allowing to easily understand what had gone wrong and quite
popular among Python developers.

Suddenly, project-wide Sentry configuration cannot be applied to those
functions since they are get executed on the remotely, outside project
context. So either every function must have a special capture handler
or let the PySpark worker take care about everything.

### Motivation

While we have this patch applied, locally, I'd like to propose it for
upstream. Currently, we have to patch PySpark for every release.
Suddenly, we cannot just patch a single file since we have also ensure
that this patch will get into pyspark.zip archive, which will be 
deployed to executors. Suddenly, there is no way found to have a plugin 
for PySpark worker to avoid any patches.

### Known concerns

1. This adds support for one of many bug tracking systems. That's true.
   The reason "why Sentry" is that it's very popular system among Python
   developers and most of them are familiar with. I personally didn't
   heard about else ones used by Python developers, but if there will
   be many of them wanted to support PySpark, we can develop something
   more plug-able solution.

### Possible alternatives

You can wrap ALL your functions which will be executed remotely on 
executors with some decorator, which will provide same Sentry support
or throw much more verbose traceback, extracting locals via `inspect`
module. This was found as very inconvenient way since you'll have to 
always wrap all your functions. Easy to forget to do.

### How to use

1. You need to have Sentry client (called raven) available on executors. 
It may be installed there via system package manager or passed via 
`sc.addPyFile` as an egg. 

2. Pass Sentry DSN via SparkConfig as executor environment variable 
   like:
   ```
   spark.conf.set('spark.executorEnv.SENTRY_DSN', '__DSN__')
   ```
   Additionally, you can configure project release, environment, tags and 
rest bits via Sentry's 
   environment variables:
   - SENTRY_ENVIRONMENT - Optional, provide environment your application is 
running in, like `production`
   - SENTRY_EXTRA_TAGS - Optional, provide tag names to be extracted from 
MDC, like `foo,bar,baz`
   - SENTRY_RELEASE - Optional, provide release version of your 
application, like `1.0.0`
   - SENTRY_TAGS - Optional, provide tags like `tag1:value1,tag2:value2`
 
3. Follow the rest Sentry documentation how to use Sentry if you're not 
familiar with. 

## How was this patch tested?

This patch tested manually on local infrastructure.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kxepal/spark 
20368-sentry-support-on-pyspark-workers

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17671.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17671


commit 8e9206f2a1c34847efe943afe51b5bdde7298914
Author: Alexander Shorin 
Date:   2017-04-17T13:25:39Z

Provide optional support for Sentry on PySpark workers

SPARK-20368




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15961: [SPARK-18523][PySpark]Make SparkContext.stop more reliab...

2016-11-28 Thread kxepal
Github user kxepal commented on the issue:

https://github.com/apache/spark/pull/15961
  
Hooray! 🎉 Thank you all for help here! (:


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15961: [SPARK-18523][PySpark]Make SparkContext.stop more reliab...

2016-11-28 Thread kxepal
Github user kxepal commented on the issue:

https://github.com/apache/spark/pull/15961
  
@holdenk 
Sure, done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15961: [SPARK-18523][PySpark]Make SparkContext.stop more reliab...

2016-11-28 Thread kxepal
Github user kxepal commented on the issue:

https://github.com/apache/spark/pull/15961
  
@holdenk 
Agree with you here. The message is fixed, PR rebased.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15961: [SPARK-18523][PySpark]Make SparkContext.stop more reliab...

2016-11-24 Thread kxepal
Github user kxepal commented on the issue:

https://github.com/apache/spark/pull/15961
  
@holdenk @srowen 
Added a warning message, please take a second look.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15961: [SPARK-18523][PySpark]Make SparkContext.stop more reliab...

2016-11-24 Thread kxepal
Github user kxepal commented on the issue:

https://github.com/apache/spark/pull/15961
  
@holdenk 
Thanks for a warning message text. Nice one!

> I indicated above this swallows all of the Py4J errors and there are a 
host of things which could cause the Py4J bridge to break down. 

Suddenly, as you may see in [issue's 
traceback](https://issues.apache.org/jira/browse/SPARK-18523), it's py4j who 
raises too general exception on such kind of problem. I was too expected to see 
there Py4JNetworkError since it's network communication issue, but this didn't 
happen. The really good exceptions are get swallowed somewhere in the middle 
and get just printed to stderr via logging, but I'm not sure how to reraise 
them and how much things this will break.

> It seems like the correct action for the user to take when the Py4J 
bridge breaks is starting over from scratch, either by exiting and re-running 
their notebook or otherwise re-submitting there job.

Yes, that's what happens now: in case of failure we have to shutdown 
notebook, start it and re-run all the cells again. If we're not running in 
notebook - crash whole the script. Here comes two issues:

1. Usability. If you made some mistake or Spark job eventually fails, you 
wouldn't restart whole the notebook, but run cell with `sc.stop` and else 
cleanup stuff and re-run your Spark cells. That's simple procedure. But when 
Spark context stop fails, you have to follow plan B: restart all the things and 
re-run all the cells. That's could be quite boring and actually it is.

2. Correctness. SparkContext is a global shared mutable object and if we 
cannot correctly reset it state to default to start over that feels like 
something really wrong here. Should we run all the code that uses SparkContext 
in subprocesses just to be able to implement retry logic otherwise? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15961: [SPARK-18523][PySpark]Make SparkContext.stop more reliab...

2016-11-23 Thread kxepal
Github user kxepal commented on the issue:

https://github.com/apache/spark/pull/15961
  
@srowen 

> Why do you say it's so completely unactionable that nobody should know 
about it? 

That's just from my experience: in all the cases when drivers dies, it dies 
in the middle of something, when you do some computations. In this case you 
already get the Py4J exception on the next operation that triggers 
communication with JVM process. You already knows about the problem and you're 
going to fix anyway. So additional reminder on `sc.stop()` was looked redundant 
for me.

> what's the downside to giving this information versus making it 
impossible to tell that the JVM failed to shut down?

Perfect question! Well, I don't have any strong argument here. It seems 
like a matter of taste about logging information usefulness.

Ok, I'll add a warning. Does the following looks good for you?:
```python
warning.warn(RuntimeWarning, 
 'Unable to stop JVM process. Probably, it was crashed or 
externally killed')
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15961: [SPARK-18523][PySpark]Make SparkContext.stop more reliab...

2016-11-23 Thread kxepal
Github user kxepal commented on the issue:

https://github.com/apache/spark/pull/15961
  
@srowen 
I was thought I pretty well described that above. TL;DR this information is 
not useful and you cannot do anything with it, but ignore, imho. 

But if you insist, I'll add a warning, that's not a problem. Just want to 
make sure that this is really reasonable to do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15961: [SPARK-18523][PySpark]Make SparkContext.stop more reliab...

2016-11-22 Thread kxepal
Github user kxepal commented on the issue:

https://github.com/apache/spark/pull/15961
  
@holdenk 
I was thought about to use warning there, but found that's may useless one. 
When I do Spark context stop I actually want to achieve either:
1. Clean up things before exit;
2. Start new Spark context with different configuration or just to restart 
broken one;

In both cases I wouldn't care much about underlying JVM process state - I'm 
shutting down things, it's over, no matter how healthy or broken they are.

Warnings are good to take you attention onto some problem and hint you to 
make some actions to sole them. Like Spark warns you if you pass unknown key to 
SparkConfig - it's not fatal, but such kind of warning you can fix.

In our case I can do nothing with this warning. I could only say "oh, ok", 
but really there no action could be done to solve that problem.

The different case is when JVM process dies in the middle of something, 
when you're not expect that. Here you'll still get your Py4JError exception 
with the same not very useful "Connection refused", but in this case you have 
to take some actions to solve that issue (restart SparkContext, increase driver 
memory, optimize your code flow etc.). In case of Py4JError on `sc.stop()` much 
likely you won't have to do anything special and shouldn't.

Let me know what you think about. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15361: [SPARK-17765][SQL] Support for writing out user-defined ...

2016-11-21 Thread kxepal
Github user kxepal commented on the issue:

https://github.com/apache/spark/pull/15361
  
@HyukjinKwon Please, do! Thanks a lot for helping here (:


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15361: [SPARK-17765][SQL] Support for writing out user-defined ...

2016-11-21 Thread kxepal
Github user kxepal commented on the issue:

https://github.com/apache/spark/pull/15361
  
@HyukjinKwon May be we can reach someone else with commit bit? Do you know 
anyone to ping? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15961: [SPARK-18523][PySpark]Make SparkContext.stop more...

2016-11-21 Thread kxepal
GitHub user kxepal opened a pull request:

https://github.com/apache/spark/pull/15961

[SPARK-18523][PySpark]Make SparkContext.stop more reliable

## What changes were proposed in this pull request?

This PR fixes SparkContext broken state in which it may fall if spark 
driver get crashed or killed by OOM.

## How was this patch tested?

1. Start SparkContext;
2. Find Spark driver process and `kill -9` it;
3. Call `sc.stop()`;
4. Create new SparkContext after that;

Without this patch you will crash on step 3 and won't be able to do step 4 
without manual reset private attibutes or IPython notebook / shell restart.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kxepal/spark 
18523-make-spark-context-stop-more-reliable

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15961.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15961






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15361: [SPARK-17765][SQL] Support for writing out user-defined ...

2016-10-07 Thread kxepal
Github user kxepal commented on the issue:

https://github.com/apache/spark/pull/15361
  
@HyukjinKwon 
It works great! Thank you! My mistake was by applying changes for the same 
`wrapperFor` method, while for 2.0.0 sources state it have to be placed in 
`wrap`  method instead with a bit modification to pass third argument in 
recursive call.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15361: [SPARK-17765][SQL] Support for writing out user-defined ...

2016-10-06 Thread kxepal
Github user kxepal commented on the issue:

https://github.com/apache/spark/pull/15361
  
@HyukjinKwon 
Oh, great news! It seems it's me backported this patch to 2.0.0 
incorrectly. I'm sorry for false alarm then - suddenly, I wasn't able to test 
it with master.

 I'll do one more try today, but so far it looks like that you solved the 
problem \o/ Thank you!






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15361: [SPARK-17765][SQL] Support for writing out user-defined ...

2016-10-06 Thread kxepal
Github user kxepal commented on the issue:

https://github.com/apache/spark/pull/15361
  
@HyukjinKwon Thank you a lot! Staying tuned.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15361: [SPARK-17765][SQL] Support for writing out user-defined ...

2016-10-06 Thread kxepal
Github user kxepal commented on the issue:

https://github.com/apache/spark/pull/15361
  
@HyukjinKwon 
Ok, try something like this:
```
scala> val sv = org.apache.spark.mllib.linalg.Vectors.sparse(7, Array(0, 
42), Array(-127, 128))
sv: org.apache.spark.mllib.linalg.Vector = (7,[0,42],[-127.0,128.0])

scala> val df = Seq(("thing", sv)).toDF("thing", "vector")
df: org.apache.spark.sql.DataFrame = [thing: string, vector: vector]

scala> df.write.format("orc").save("/tmp/thing.orc")
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15361: [SPARK-17765][SQL] Support for writing out user-defined ...

2016-10-06 Thread kxepal
Github user kxepal commented on the issue:

https://github.com/apache/spark/pull/15361
  
@HyukjinKwon 
Thanks for the patch, but suddenly it doesn't solves the issue. Tested with 
2.0.0 Spark:
```
Caused by: java.lang.ClassCastException: 
org.apache.spark.mllib.linalg.VectorUDT cannot be cast to 
org.apache.spark.sql.types.StructType
  at 
org.apache.spark.sql.hive.HiveInspectors$class.wrap(HiveInspectors.scala:558)
  at 
org.apache.spark.sql.hive.orc.OrcSerializer.wrap(OrcFileFormat.scala:164)
  at 
org.apache.spark.sql.hive.orc.OrcSerializer.wrapOrcStruct(OrcFileFormat.scala:202)
  at 
org.apache.spark.sql.hive.orc.OrcSerializer.serialize(OrcFileFormat.scala:168)
  at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.writeInternal(OrcFileFormat.scala:253)
  at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:255)
  at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
  at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
  at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
  at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
```

Let me try to make simple scala test case that reproduces the issue from 
shell. May be this will be more helpful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org