Re: pyspark exception catch
Hi, Thanks for the answer. regarding 2,3, its indeed the solution, but as I mentioned in my question, I can as well do input checks (using .map) before applying any other rdd operations. I still think that its overhead. Regarding 1, this will make all the other rdd operations more complex, as I will need probably to wrap everything around mapPartitions(). So back to my original question, why simply not allow to catch exception directly as result from rdd operations? Why spark cannot catch expectations on the different workers and return them back to the driver where the developer can catch them and decide on the next step? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-exception-catch-tp20483p20788.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: pyspark exception catch
hey igor! a few ways to work around this depending on the level of exception-handling granularity you're willing to accept: 1) use mapPartitions() to wrap the entire partition handling code in a try/catch -- this is fairly coarse-grained, however, and will fail the entire partition. 2) modify your transformation code to wrap a try-catch around the individual record handler -- return either None (or some other well-known empty value) for input records that fail and the actual value for records that succeed. use a filter() to filter out the None values. 3) same as #2, but use empty array for a failure and a single-element array for a success. hope that helps! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-exception-catch-tp20483p20730.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
pyspark exception catch
Hi , Is it possible to catch exceptions using pyspark so in case of error, the program will not fail and exit. for example if I am using (key, value) rdd functionality but the data don't have actually (key, value) format, pyspark will throw exception (like ValueError) that I am unable to catch. I guess this is because the errors occurs on each worker where I don't have full control. Also probably because of the DAG. For example (see below), it is useless to catch exception on the .foldByKey since its transformation and not action, as a result the transformation will be piped and materialized when some action applied, like .first(). But even when trying to catch exception on the action, will fail. I would expect that eventually the different exceptions will be collected and return back to the driver, where the developer could control it and decide on the next step. ** of course I can first check the input to verify that it matches (key, value), but for my opinion this will be overhead and will involve extra transformations. code example: data = [((1,),'e'),((2,),'b'),((1,),'aa', 'e'),((2,),'bb', 'e'),((5,),'a', 'e')] pdata = sc.parallelize(data,3) t=pdata.foldByKey([], lambda v1, v2: v1+[v2] if type(v2) != list else v1+v2) t.first() this also fail: try: t.first() except ValueError, e: pass Best, -- *--* *Igor Mazor* Senior Business Intelligence Manager Rocket Internet AG | Johannisstraße 20 | 10117 Berlin | Deutschland skype: igor_rocket_internet | mail: igor.mazor http://goog_862328191 @rocket-internet.de www.rocket-internet.de Geschäftsführer: Dr. Johannes Bruder, Arnt Jeschke, Alexander Kudlich Eingetragen beim Amtsgericht Berlin, HRB 109262 USt-ID DE256469659