[ https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13986220#comment-13986220 ]
Tathagata Das commented on SPARK-1645: -------------------------------------- This makes sense from the integration point of view. Though I wonder from thePOV of Flume's deployment configuration does it make things more complex? Like for example, if someone has a the flume system already setup, in the current situation, the configuration change to add a new sink seems standard and easy. However, in the proposed model, since Flume's data pushing node has to run a sink, how much complicated does this configuration process get? > Improve Spark Streaming compatibility with Flume > ------------------------------------------------ > > Key: SPARK-1645 > URL: https://issues.apache.org/jira/browse/SPARK-1645 > Project: Spark > Issue Type: Bug > Components: Streaming > Reporter: Hari Shreedharan > > Currently the following issues affect Spark Streaming and Flume compatibilty: > * If a spark worker goes down, it needs to be restarted on the same node, > else Flume cannot send data to it. We can fix this by adding a Flume receiver > that is polls Flume, and a Flume sink that supports this. > * Receiver sends acks to Flume before the driver knows about the data. The > new receiver should also handle this case. > * Data loss when driver goes down - This is true for any streaming ingest, > not just Flume. I will file a separate jira for this and we should work on it > there. This is a longer term project and requires considerable development > work. > I intend to start working on these soon. Any input is appreciated. (It'd be > great if someone can add me as a contributor on jira, so I can assign the > jira to myself). -- This message was sent by Atlassian JIRA (v6.2#6252)