We ended up implementing custom Hadoop InputFormats and RecordReaders by
extending FileInputFormat / RecordReader, and using sc.newAPIHadoopFile to
read it as an RDD.
On Wed, Mar 9, 2016 at 9:15 AM Ruslan Dautkhanov
wrote:
> We have a huge binary file in a custom
We have a somewhat complex pipeline which has multiple output files on
HDFS, and we'd like the materialization of those outputs to happen
concurrently.
Internal to Spark, any "save" call creates a new "job", which runs
synchronously -- that is, the line of code after your save() executes once
the
> On Wed, Mar 2, 2016 at 3:46 PM, Andy Sloane <andy.slo...@gmail.com> wrote:
>
>> We are seeing something that looks a lot like a regression from spark
>> 1.2. When we run jobs with multiple threads, we have a crash somewhere
>> inside getPreferredLocations, as was f
We are seeing something that looks a lot like a regression from spark 1.2.
When we run jobs with multiple threads, we have a crash somewhere inside
getPreferredLocations, as was fixed in SPARK-4454. Except now it's inside
org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs