Hi , Is it possible to catch exceptions using pyspark so in case of error, the program will not fail and exit.
for example if I am using (key, value) rdd functionality but the data don't have actually (key, value) format, pyspark will throw exception (like ValueError) that I am unable to catch. I guess this is because the errors occurs on each worker where I don't have full control. Also probably because of the DAG. For example (see below), it is useless to catch exception on the .foldByKey since its transformation and not action, as a result the transformation will be piped and materialized when some action applied, like .first(). But even when trying to catch exception on the action, will fail. I would expect that eventually the different exceptions will be collected and return back to the driver, where the developer could control it and decide on the next step. ** of course I can first check the input to verify that it matches (key, value), but for my opinion this will be overhead and will involve extra transformations. code example: data = [((1,),'e'),((2,),'b'),((1,),'aa', 'e'),((2,),'bb', 'e'),((5,),'a', 'e')] pdata = sc.parallelize(data,3) t=pdata.foldByKey([], lambda v1, v2: v1+[v2] if type(v2) != list else v1+v2) t.first() this also fail: try: t.first() except ValueError, e: pass Best, -- *--------------------------------------------------------------------------* *Igor Mazor* Senior Business Intelligence Manager Rocket Internet AG | Johannisstraße 20 | 10117 Berlin | Deutschland skype: igor_rocket_internet | mail: igor.mazor <http://goog_862328191> @rocket-internet.de www.rocket-internet.de Geschäftsführer: Dr. Johannes Bruder, Arnt Jeschke, Alexander Kudlich Eingetragen beim Amtsgericht Berlin, HRB 109262 USt-ID DE256469659