Hi! I've been using spark for the last months and it is awesome. I'm pretty new on this topic so don't be too harsh on me. Recently I've been doing some simple tests with Spark Streaming for log processing and I'm considering different ETL input solutions such as Flume or PDI+Kafka.
My use case will be: 1.- Collect logs from different applications located in different physical servers. 2.- Transform and pre-process those logs. 3.- Process all the logs data with spark streaming. I've got a question regarding data processing where the data is located. Ideally I'd like spark-streaming (standalone, yarn or mesos) to handle the decision of processing data wherever it is located. I know I can setup whatever flume workflow (agents --> collectors) I want and then upload the aggregated data to the HDFS. Where I guess the system will handle the best worker to operate on every split of data. Am I right? Will spark-streaming + flume integration (without sinking into HDFS) provide this kind of behavior? Any tips to point me in the right direction?