I'm using PySpark to load some data and getting an error while parsing it. Is it possible to find the source file and line of the bad data? I imagine that this would be extremely tricky when dealing with multiple derived RRDs, so an answer with the caveat of "this only works when running .map() on an textFile() RDD" is totally fine. Perhaps if the line number and file was available in pyspark I could catch the exception and output it with the context?
Anyway to narrow down the problem input would be great. Thanks!
