I am running Spark 1.6.1 built for Hadoop 2.0.0-mr1-cdh4.2.0 and using kafka
direct stream approach. I am running into performance problems. My
processing time is > than my window size. Changing window sizes, adding
cores and executor memory does not change performance. I am having a lot of
trouble identifying the problem by at the metrics provided for streaming
apps in the spark application web UI.

I think my performance problem has to with writing the data to S3.

My app receives very complicated JSON. My program is simple, It sorts the
data into a small set of sets and writes each set as a separate S3 object.
The mini batch data has at most 300 events so I do not think shuffle is an
issue.

DataFrame rawDF = sqlContext.read().json(jsonRDD).cache();

Š Explode tagCol Š



DataFrame rulesDF = activityDF.select(tagCol).distinct();

Row[] rows = rulesDF.select(tagCol).collect();

List<String> tags = new ArrayList<String>(100);

for (Row row : rows) {

Object tag = row.get(0);

tags.add(tag.toString());

}



I think the for loop bellow is where the bottle neck is. Is write async() ?



If not is there an easy to to vectorize/parallelize this for loop or do I
have to create the threads my self?



Is creating threads in spark a bad idea?





for(String tag : tags) {

DataFrame saveDF = activityDF.filter(activityDF.col(tagCol).equalTo(tag));

if (saveDF.count() >= 1) { // I do not think count() is an issue performance
is about 34 ms

String dirPath = ³s3n://myBucket" + File.separator + date + File.separator +
tag + File.separator +  milliSeconds;

saveDF.write().json(dirPath);

}

}



Any suggestions would be greatly appreciated



Andy
> 
> 


Reply via email to