Re: binary file deserialization

2016-03-09 Thread Andy Sloane
We ended up implementing custom Hadoop InputFormats and RecordReaders by extending FileInputFormat / RecordReader, and using sc.newAPIHadoopFile to read it as an RDD. On Wed, Mar 9, 2016 at 9:15 AM Ruslan Dautkhanov wrote: > We have a huge binary file in a custom

Saving multiple outputs in the same job

2016-03-08 Thread Andy Sloane
We have a somewhat complex pipeline which has multiple output files on HDFS, and we'd like the materialization of those outputs to happen concurrently. Internal to Spark, any "save" call creates a new "job", which runs synchronously -- that is, the line of code after your save() executes once the

Re: getPreferredLocations race condition in spark 1.6.0?

2016-03-02 Thread Andy Sloane
> On Wed, Mar 2, 2016 at 3:46 PM, Andy Sloane <andy.slo...@gmail.com> wrote: > >> We are seeing something that looks a lot like a regression from spark >> 1.2. When we run jobs with multiple threads, we have a crash somewhere >> inside getPreferredLocations, as was f

getPreferredLocations race condition in spark 1.6.0?

2016-03-02 Thread Andy Sloane
We are seeing something that looks a lot like a regression from spark 1.2. When we run jobs with multiple threads, we have a crash somewhere inside getPreferredLocations, as was fixed in SPARK-4454. Except now it's inside org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs