[Spark CSV]: Use Custom TextInputFormat to Prevent Exceptions

2017-03-15 Thread Nathan Case
Accidentally sent this to the dev mailing list, meant to send it here. I have a spark java application that in the past has used the hadoopFile interface to specify a custom TextInputFormat to be used when reading files. This custom class would gracefully handle exceptions like EOF exceptions

Partitioning Data to optimize combineByKey

2016-06-02 Thread Nathan Case
Hello, I am trying to process a dataset that is approximately 2 tb using a cluster with 4.5 tb of ram. The data is in parquet format and is initially loaded into a dataframe. A subset of the data is then queried for and converted to RDD for more complicated processing. The first stage of that