I'm fairly new to Spark. 
We have data in avro files on hdfs.
We are trying to load up all the avro files (28 gigs worth right now) and do
an aggregation.

When we have less than 200 tasks the data all runs and produces the proper
results. If there are more than 200 tasks (as stated in the logs by the
TaskSetManager) the data seems to only group when it reads in the RDD from
hdfs by the first record in the avro file. 

If I set: spark.shuffle.sort.bypassMergeThreshold greater than 200 data
seems to work. I don't understand why or how?

Here is the relevant code pieces:
JavaSparkContext context = new JavaSparkContext(
            new SparkConf()
                .setAppName(AnalyticsJob.class.getSimpleName())
                .set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
);


context.hadoopConfiguration().set("mapreduce.input.fileinputformat.input.dir.recursive",
"true");

context.hadoopConfiguration().set("mapreduce.input.fileinputformat.inputdir",
job.inputDirectory);

JavaRDD<AnalyticsEvent> events = ((JavaRDD<AvroKey&lt;AnalyticsEvent>>)
context.newAPIHadoopRDD(
            context.hadoopConfiguration(),
            AvroKeyInputFormat.class,
            AvroKey.class,
            NullWritable.class
        ).keys())
        .map(event -> event.datum())
        .filter(key -> { return
Optional.ofNullable(key.getStepEventKey()).isPresent(); })
        .mapToPair(event -> new Tuple2<AnalyticsEvent, Integer>(event, 1))
        .groupByKey()
        .map(tuple -> tuple._1());

        events.persist(StorageLevel.MEMORY_AND_DISK_2());

If I do a collect on events at this point the data is not as expected and
jumbled, so when we pass it onto the next job in our pipeline for
aggregation, the data doesn't come out as expected.

The downstream tasks maps to pairs again and stores in the db.

Thanks in advance for this help.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-data-incorrect-when-more-than-200-tasks-tp21710.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to