Re: pyspark exception catch

2014-12-19 Thread imazor
Hi,

Thanks for the answer.

regarding 2,3, its indeed the solution, but as I mentioned in my question, I
can as well do input checks (using .map) before applying any other rdd
operations. I still think that its overhead.

Regarding 1, this will make all the other rdd operations more complex, as I
will need probably to wrap everything around mapPartitions().

So back to my original question, why simply not allow to catch exception
directly as result from rdd operations? Why spark cannot catch expectations
on the different workers and return them back to the driver where the
developer can catch them and decide on the next step?

 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-exception-catch-tp20483p20788.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: pyspark exception catch

2014-12-16 Thread cfregly
hey igor!

a few ways to work around this depending on the level of exception-handling
granularity you're willing to accept:
1) use mapPartitions() to wrap the entire partition handling code in a
try/catch -- this is fairly coarse-grained, however, and will fail the
entire partition.
2) modify your transformation code to wrap a try-catch around the individual
record handler -- return either None (or some other well-known empty
value) for input records that fail and the actual value for records that
succeed.  use a filter() to filter out the None values.
3) same as #2, but use empty array for a failure and a single-element array
for a success.

hope that helps!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-exception-catch-tp20483p20730.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



pyspark exception catch

2014-12-05 Thread Igor Mazor
Hi ,

Is it possible to catch exceptions using pyspark so in case of error, the
program will not fail and exit.

for example if I am using (key, value) rdd functionality but the data don't
have actually (key, value) format, pyspark will throw exception (like
ValueError) that I am unable to catch.

I guess this is because the errors occurs on each worker where I don't have
full control.
Also probably because of the DAG. For example (see below), it is useless to
catch exception on the .foldByKey since its transformation and not action,
as a result the transformation will be piped and materialized when some
action applied, like .first().
But even when trying to catch exception on the action, will fail.

I would expect that eventually the different exceptions will be collected
and return back to the driver, where the developer could control it and
decide on the next step.

** of course I can first check the input to verify that it matches (key,
value), but for my opinion this will be overhead and will involve extra
transformations.


code example:

data = [((1,),'e'),((2,),'b'),((1,),'aa', 'e'),((2,),'bb', 'e'),((5,),'a', 'e')]
pdata = sc.parallelize(data,3)
t=pdata.foldByKey([], lambda v1, v2: v1+[v2] if type(v2) != list else v1+v2)

t.first()

this also fail:

try:

  t.first()

except ValueError, e:

  pass


Best,


-- 
*--*

*Igor Mazor*
Senior Business Intelligence Manager


Rocket Internet AG  |  Johannisstraße 20  |  10117 Berlin  |  Deutschland
skype: igor_rocket_internet | mail: igor.mazor http://goog_862328191
@rocket-internet.de
www.rocket-internet.de

Geschäftsführer: Dr. Johannes Bruder, Arnt Jeschke, Alexander Kudlich
Eingetragen beim Amtsgericht Berlin, HRB 109262 USt-ID DE256469659