Hi, I have two different parts in my system.
1. Batch application that every x minutes do sql queries between several tables that contains millions of rows to compound a entity, and sent that entities to Kafka. 2. Streaming application that processing data from Kafka. Now, I have entire system working, but I want to improve the performance in the batch part, because if I have 100 millions of entities I send them to Kafka in a foreach method in a row, which makes no sense for the next streaming application. I want, send each 10 millions events to Kafka, for example. I have a query, imagine *select ... from table 1 left outer join table 2 on ... left outer join table 3 on ... left outer join table 4 on ...* My target is do *pagination* on table 1 and take 10 million in a separate RDD, do the joins and send to Kafka, then take another 10 million and do the same... I have all tables in parquet format in hdfs. I think to use *toLocalIterator* method and something like that, but I have doubts about memory and parallelism and sure there is a better way to do it. rdd.toLocalIterator.grouped(10000000).foreach( seq => val rdd: RDD[(String, Int)] = sc.parallelize(seq) // Do the processing ) What do you think? Regards. -- Gaspar Muñoz @gmunozsoria <http://www.stratio.com/> Vía de las dos Castillas, 33, Ática 4, 3ª Planta 28224 Pozuelo de Alarcón, Madrid Tel: +34 91 352 59 42 // *@stratiobd <https://twitter.com/StratioBD>*