Hi,
I know a task can fail 2 times and only the 3rd breaks the entire job. I am good with this number of attempts. I would like that after trying a task 3 times, it continues with the other tasks. The job can be "failed", but I want all tasks run. Please see my use case. I read a hadoop input set, and some gzip files are incomplete. I would like to just skip them and the only way I see is to tell Spark to ignore some tasks permanently failing, if it is possible. With traditional hadoop map-reduce this was possible using mapred.max.map.failures.percent. Do map-reduce params like mapred.max.map.failures.percent apply to Spark/YARN map-reduce jobs ? I edited $HADOOP_CONF_DIR/mapred-site.xml and added mapred.max.map.failures.percent=30 but does not seem to apply, job still failed after 3 task attempt fails. Should Spark transmit this parameter? Or the mapred.* do not apply? Do other hadoop parameters (e.g. the ones involved in the input reading, not in the "processing" or "application" like this max.map.failures) - are others taken into account and transmitted? I saw that it should scan HADOOP_CONF_DIR and forward those, but I guess this does not apply to any parameter, since Spark has its own distribution & DAG stages processing logic, which just happens to have a YARN implementation. Do you know a way to do this in Spark - to ignore a predefined number of tasks fail, but allow the job to continue? This way I could see all the faulty input files in one job run, delete them all and continue with the rest. Just to mention, doing a manual gzip -t on top of hadoop cat is infeasible and map-reduce is way faster to scan the 15K files worth 70GB (its doing 25M/s per node), while the old style hadoop cat is doing much less. Thanks, Nicu