Finding bad data

Jim Blomo Thu, 24 Apr 2014 18:16:27 -0700

I'm using PySpark to load some data and getting an error while
parsing it.  Is it possible to find the source file and line of the bad
data?  I imagine that this would be extremely tricky when dealing with
multiple derived RRDs, so an answer with the caveat of "this only
works when running .map() on an textFile() RDD" is totally fine.
Perhaps if the line number and file was available in pyspark I could
catch the exception and output it with the context?


Anyway to narrow down the problem input would be great. Thanks!

Finding bad data

Reply via email to