Hi ,

Is it possible to catch exceptions using pyspark so in case of error, the
program will not fail and exit.

for example if I am using (key, value) rdd functionality but the data don't
have actually (key, value) format, pyspark will throw exception (like
ValueError) that I am unable to catch.

I guess this is because the errors occurs on each worker where I don't have
full control.
Also probably because of the DAG. For example (see below), it is useless to
catch exception on the .foldByKey since its transformation and not action,
as a result the transformation will be piped and materialized when some
action applied, like .first().
But even when trying to catch exception on the action, will fail.

I would expect that eventually the different exceptions will be collected
and return back to the driver, where the developer could control it and
decide on the next step.

** of course I can first check the input to verify that it matches (key,
value), but for my opinion this will be overhead and will involve extra
transformations.


code example:

data = [((1,),'e'),((2,),'b'),((1,),'aa', 'e'),((2,),'bb', 'e'),((5,),'a', 'e')]
pdata = sc.parallelize(data,3)
t=pdata.foldByKey([], lambda v1, v2: v1+[v2] if type(v2) != list else v1+v2)

t.first()

this also fail:

try:

  t.first()

except ValueError, e:

  pass


Best,


-- 
*--------------------------------------------------------------------------*

*Igor Mazor*
Senior Business Intelligence Manager


Rocket Internet AG  |  Johannisstraße 20  |  10117 Berlin  |  Deutschland
skype: igor_rocket_internet | mail: igor.mazor <http://goog_862328191>
@rocket-internet.de
www.rocket-internet.de

Geschäftsführer: Dr. Johannes Bruder, Arnt Jeschke, Alexander Kudlich
Eingetragen beim Amtsgericht Berlin, HRB 109262 USt-ID DE256469659

Reply via email to