Accidentally sent this to the dev mailing list, meant to send it here.
I have a spark java application that in the past has used the hadoopFile
interface to specify a custom TextInputFormat to be used when reading
files. This custom class would gracefully handle exceptions like EOF
exceptions
Hello,
I am trying to process a dataset that is approximately 2 tb using a cluster
with 4.5 tb of ram. The data is in parquet format and is initially loaded
into a dataframe. A subset of the data is then queried for and converted
to RDD for more complicated processing. The first stage of that