[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422202#comment-15422202 ] Laxman commented on FLUME-1227: --- [~roshan_naik], we are planning to use this channel. But found that this does not persist in-memory data on shutdown. Found FLUME-2396 has been filed for the same. IMHO, dataloss in a channel with persistence may not be acceptable. I can work with you, if you feel this should be fixed. > Introduce some sort of SpillableChannel > --- > > Key: FLUME-1227 > URL: https://issues.apache.org/jira/browse/FLUME-1227 > Project: Flume > Issue Type: New Feature > Components: Channel >Affects Versions: v1.4.0 >Reporter: Jarek Jarcec Cecho >Assignee: Roshan Naik > Fix For: v1.5.0 > > Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, > FLUME-1227.v6.patch, FLUME-1227.v7.patch, FLUME-1227.v8.patch, > FLUME-1227.v9.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory > Channel Design.pdf > > > I would like to introduce new channel that would behave similarly as scribe > (https://github.com/facebook/scribe). It would be something between memory > and file channel. Input events would be saved directly to the memory (only) > and would be served from there. In case that the memory would be full, we > would outsource the events to file. > Let me describe the use case behind this request. We have plenty of frontend > servers that are generating events. We want to send all events to just > limited number of machines from where we would send the data to HDFS (some > sort of staging layer). Reason for this second layer is our need to decouple > event aggregation and front end code to separate machines. Using memory > channel is fully sufficient as we can survive lost of some portion of the > events. However in order to sustain maintenance windows or networking issues > we would have to end up with a lot of memory assigned to those "staging" > machines. Referenced "scribe" is dealing with this problem by implementing > following logic - events are saved in memory similarly as our MemoryChannel. > However in case that the memory gets full (because of maintenance, networking > issues, ...) it will spill data to disk where they will be sitting until > everything start working again. > I would like to introduce channel that would implement similar logic. It's > durability guarantees would be same as MemoryChannel - in case that someone > would remove power cord, this channel would lose data. Based on the > discussion in FLUME-1201, I would propose to have the implementation > completely independent on any other channel internal code. > Jarcec -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13916097#comment-13916097 ] Hari Shreedharan commented on FLUME-1227: - [~roshan_naik] - When we roll 1.5, jiras with no fix versions will be updated. Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, FLUME-1227.v6.patch, FLUME-1227.v7.patch, FLUME-1227.v8.patch, FLUME-1227.v9.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915146#comment-13915146 ] Hari Shreedharan commented on FLUME-1227: - +1. I am going to run tests and commit this one. Since this is being marked as experimental, I made a change in the user guide to clarify it is not recommended for production use. I also made some minor indentation changes in SpillableMemoryChannel.java Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, FLUME-1227.v6.patch, FLUME-1227.v7.patch, FLUME-1227.v8.patch, FLUME-1227.v9.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915184#comment-13915184 ] ASF subversion and git services commented on FLUME-1227: Commit d5805c8598be4eec85de8973b4c98ecdd7ffe6d3 in flume's branch refs/heads/flume-1.5 from [~hshreedharan] [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=d5805c8 ] FLUME-1227. Introduce Spillable Channel. (Roshan Naik via Hari Shreedharan) Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, FLUME-1227.v6.patch, FLUME-1227.v7.patch, FLUME-1227.v8.patch, FLUME-1227.v9.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915183#comment-13915183 ] ASF subversion and git services commented on FLUME-1227: Commit 6a50ec2ad33b8cbd057907c67030d855520c5f13 in flume's branch refs/heads/trunk from [~hshreedharan] [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=6a50ec2 ] FLUME-1227. Introduce Spillable Channel. (Roshan Naik via Hari Shreedharan) Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, FLUME-1227.v6.patch, FLUME-1227.v7.patch, FLUME-1227.v8.patch, FLUME-1227.v9.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915232#comment-13915232 ] Roshan Naik commented on FLUME-1227: Should we set the 'fix version' to 1.5 ? Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, FLUME-1227.v6.patch, FLUME-1227.v7.patch, FLUME-1227.v8.patch, FLUME-1227.v9.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13913186#comment-13913186 ] Otis Gospodnetic commented on FLUME-1227: - Was just about to write to the ML asking about this functionality. Looks like all known issues have been fixed, plus this is new functionality, so it should go in and get some real-world action, which we'd love to give it as soon as 1.5.0 is out! +10 for committing this. Any chances of this going in before 1.5.0 is cut? Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, FLUME-1227.v6.patch, FLUME-1227.v7.patch, FLUME-1227.v8.patch, FLUME-1227.v9.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13913190#comment-13913190 ] Thilo Seidel commented on FLUME-1227: - Guten Tag, Ich bin heute nicht im Büro. Ihre Mail wird bis zu meiner Rückkehr weder gelesen noch automatisch weitergeleitet. Viele Grüße Thilo Seidel Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, FLUME-1227.v6.patch, FLUME-1227.v7.patch, FLUME-1227.v8.patch, FLUME-1227.v9.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13880852#comment-13880852 ] Roshan Naik commented on FLUME-1227: [~hshreedharan] if there are no other comments.. could you look into committing this ? Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, FLUME-1227.v6.patch, FLUME-1227.v7.patch, FLUME-1227.v8.patch, FLUME-1227.v9.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872552#comment-13872552 ] Hari Shreedharan commented on FLUME-1227: - [~roshan_naik] - Is this ready for review (since you have not hit Submit Patch)? Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, FLUME-1227.v6.patch, FLUME-1227.v7.patch, FLUME-1227.v8.patch, FLUME-1227.v9.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13857070#comment-13857070 ] Brock Noland commented on FLUME-1227: - Thank you for addressing the feedback! I am OK with your reasoning regarding adding dual checkpointing to the example. I haven't looked at this code and review in detail. It looks like Hari has, so I think he'll have to make the call of when to commit. Thank you for your hard work Roshan! Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, FLUME-1227.v6.patch, FLUME-1227.v7.patch, FLUME-1227.v8.patch, FLUME-1227.v9.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13853707#comment-13853707 ] Roshan Naik commented on FLUME-1227: thanks for the feedback [~brocknoland] Will incorporate ur feedback and update the patch soon. WRT to the adding notes on file channel best practices into Spillable Channel section, i am not too hot on that unless it has specifically to do with its coupling with Spillable channel. In (FLUME-2239) recently I made a note about multiple data dirs helping file channel perf. Also the dual checkpoint feature is broken on Windows(FLUME-2224). Let me know if you feel otherwise. Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, FLUME-1227.v6.patch, FLUME-1227.v7.patch, FLUME-1227.v8.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13851768#comment-13851768 ] Brock Noland commented on FLUME-1227: - Hey, I have not participated in the review til now so sorry about this...but I just noticed the following items which are mostly nits and improvements. SpillableMemoryChannel 1. Static stuff should be at the top 2. Constructor should be directly below fields 3. String constants should be static final fields with javadoc description 4. Stuff can be final: {noformat} private Object queueLock = new Object(); {noformat} TestSpillableMemoryChannel 1. Take null has a commented out assertion 2. There are locations where we expect Exception that should be a specific type of exception. 3. Let's not use e.printStackTrace(); 4. Places we assert boolean should have a message 5. Many missing spaces such as: {noformat} for (int i=0; icount; ++i) { {noformat} and {noformat} nullsFound=count; {noformat} Docs 1. Please specify multiple data directories in the examples and add a note that file channel performance will increase dramatically with multiple disks. 2. Add dual checkpoint to the examples as that is a good practice. Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, FLUME-1227.v6.patch, FLUME-1227.v7.patch, FLUME-1227.v8.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13850319#comment-13850319 ] Roshan Naik commented on FLUME-1227: Hi [~hshreedharan].. i have addressed most of your comments locally.. but will need another day to address your comments on incorrect counter test issue. it needs some thinking through on my part.. thanks for catching them. Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, FLUME-1227.v6.patch, FLUME-1227.v7.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13849852#comment-13849852 ] Hari Shreedharan commented on FLUME-1227: - Hey [~roshan_naik] - Any updates here? Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, FLUME-1227.v6.patch, FLUME-1227.v7.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843403#comment-13843403 ] Hari Shreedharan commented on FLUME-1227: - Hi Roshan, In the takePrimary and takeOverflow methods, there is a Preconditions.checkArgument method where like you mentioned in takePrimary method comments, there is an int-Integer-String conversion in a hot path (this is handled with an if in the takePrimary method, not in takeOverflow) - can you get rid of the the preconditions call, and just do: if (...) { throw IllegalStateException(..) }. This for one is cleaner, since the if already checks for the issue and we can avoid an unneeded method call. Is this because rolling back the overflow txn will ensure that the event goes back into the file channel and you don't need to handle it? {code} if (!useOverflow) { takeList.offer(event); // takeList is thd pvt, so no need to do this in synchronized block } {code} If that is the case the counters are incorrect when the transaction committed is overflow transaction, since this is how they are updated: {code} channelCounter.addToEventTakeSuccessCount(takeList.size()); {code} Even this is not accurate: {code} if (takeList.size() largestTakeTxSize) largestTakeTxSize = takeList.size(); {code} There are also a couple issue with regards to failed transactions when writing to primary (granted it is a queue and it should not fail, but if a lock acquire gets interrupted, it can still fail). The memQueueRemaining semaphore has already been updated before pushing the events to the queue (that is definitely the right thing to do), but if a queue.offer fails the memQueueRemaining is not updated. This might be an issue with the current channels too - and is sufficiently rare to say we can revisit this later. Also there is a possibility of partially successful transactions right now (if the queue inserts fail - that I guess is true for all channels right now, so I guess we can live with it - just mentioning it to ensure that we know it is a possibility and we can revisit if needed). Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, FLUME-1227.v6.patch, FLUME-1227.v7.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843421#comment-13843421 ] Hari Shreedharan commented on FLUME-1227: - Also, there are several lines 80 characters. Can you make sure that you fix this too. For comments, please put the comments before the relevant line if they are expected to be long. Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, FLUME-1227.v6.patch, FLUME-1227.v7.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843442#comment-13843442 ] Hari Shreedharan commented on FLUME-1227: - The patch seems to be failing tests : {code} --- Picked up _JAVA_OPTIONS: -Djava.awt.headless=true Running org.apache.flume.channel.TestSpillableMemoryChannel Tests run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 103.657 sec FAILURE! testTotalStoredSemaphore(org.apache.flume.channel.TestSpillableMemoryChannel) Time elapsed: 2923 sec FAILURE! java.lang.AssertionError: expected:0 but was:4500 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.junit.Assert.assertEquals(Assert.java:456) at org.apache.flume.channel.TestSpillableMemoryChannel.testTotalStoredSemaphore(TestSpillableMemoryChannel.java:735) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30) at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:46) at org.junit.rules.RunRules.evaluate(RunRules.java:18) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222) at org.junit.runners.ParentRunner.run(ParentRunner.java:300) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189) at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165) at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75) Results : Failed tests: testTotalStoredSemaphore(org.apache.flume.channel.TestSpillableMemoryChannel): expected:0 but was:4500 {code} Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, FLUME-1227.v6.patch, FLUME-1227.v7.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843787#comment-13843787 ] Roshan Naik commented on FLUME-1227: - will fix the 80 character length issue you noted - I will need to review code wrt your other comments related to Txn correctness more closely. let me get back to you on them. - [~hshreedharan] could you please confirm that the test failure was noticed in in patch v7 ? Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, FLUME-1227.v6.patch, FLUME-1227.v7.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843789#comment-13843789 ] Hari Shreedharan commented on FLUME-1227: - Yes, it was v7. Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, FLUME-1227.v6.patch, FLUME-1227.v7.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13814292#comment-13814292 ] Roshan Naik commented on FLUME-1227: [~hshreedharan] , all the review comments should be addressed now. if there are no other concerns, could you commit this ? Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, FLUME-1227.v6.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13795589#comment-13795589 ] Hari Shreedharan commented on FLUME-1227: - [~roshan_naik] - Could you please update the patch on rb? Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, FLUME-1227.v2.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13795614#comment-13795614 ] Roshan Naik commented on FLUME-1227: [~hshreedharan] just updated it. Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, FLUME-1227.v2.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13747758#comment-13747758 ] Roshan Naik commented on FLUME-1227: [hshreedharan], others interested.. could you take a stab at reviewing this code ? Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, FLUME-1227.v2.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13726702#comment-13726702 ] Roshan Naik commented on FLUME-1227: Appreciate your feedback Hari. HARI It looks like channel can actually return fewer events than total available in the case where there are only n events in the primary queue and an n+1-th take would happen - since the events in a particular txn will always come from one queue. I think we should be able to pull events from the other store if it turns out to be required - else we expect the sink to come back and poll immediately - and also cause sink side transactions to be smaller than they have to be - which can cause Avro/HDFS batch sizes to be smaller than configured causing perf issues. Yes that is correct. The sink's transaction batch size would be smaller in that case. The case would only occur in when the take transaction transitions between overflow and primary. The alternative, as you sugest, is to pull from both overflow and primary, but that opens up some fundamental problems similar to distributed transactions. Essentially the sink needs to have two transactions open (one each on overflow and primary) which needs to be atomically committed/rolledback. Thoughts ? HARI How the channel recovers from an overflow situation. I have updated the design doc (section 2.1.2) to elaborate on this. The short version is: New incoming events will go into primary if the sinks have drained older events from the primary even if overflow is not empty. Let me know if the description addresses your question sufficiently. Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, FLUME-1227.v2.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13721312#comment-13721312 ] Hari Shreedharan commented on FLUME-1227: - Hi Roshan, Thanks for the updated design doc and patch. I looked at the design doc and this approach looks good. I like the fact that there are no dependencies (at least as mentioned in the doc) on the file channel's implicit behavior. I have on question though. The drain order queue seems to keep a count of how many events are written to which store each time a write happens (using the -ve and +ve numbers). It looks like channel can actually return fewer events than total available in the case where there are only n events in the primary queue and an n+1-th take would happen - since the events in a particular txn will always come from one queue. I think we should be able to pull events from the other store if it turns out to be required - else we expect the sink to come back and poll immediately - and also cause sink side transactions to be smaller than they have to be - which can cause Avro/HDFS batch sizes to be smaller than configured causing perf issues. Also, I am not clear on how the channel recovers from an overflow situation. Assume that the primary has capacity of n and we are currently overflowing. When do we decide to go back to the primary? Is it when all n from the primary have been removed, or we don't go back to it until restart (sorry I didn't look at the code yet - this does not seem to have gotten a mention in the design doc). Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, FLUME-1227.v2.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13628201#comment-13628201 ] Hari Shreedharan commented on FLUME-1227: - Thanks for your patience with this Roshan. This approach seems fine. It is a good idea to explicitly do the instantiation inside the SC. You can go ahead with that for now I guess. But here is some food for thought - The fundamental difference between this channel and the File Channel is the way the transactions get written out. Have you considered inheriting the File Channel and then adding a 2nd data structure (your primary memory channel) and have the decision making happen in the transaction code? I am not sure how feasible it is or even how smart an idea it is, but it might be worth considering. Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13628260#comment-13628260 ] Roshan Naik commented on FLUME-1227: Thats a very interesting suggestion. Thanks. I shall play with that idea also. Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13625573#comment-13625573 ] Roshan Naik commented on FLUME-1227: Hari, Juhani, if there is no additional concerns then i shall proceed with this approach. Settling on the general approach now will help us avoid pouring efforts into an unacceptable direction. I shall wait for another day before proceeding. Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626167#comment-13626167 ] Juhani Connolly commented on FLUME-1227: Seems like a reasonable compromise to me. I think any approach will have issues. 3 would probably be preferable to 4 if it's doable Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13621057#comment-13621057 ] Mike Percy commented on FLUME-1227: --- Roshan, that sounds good to me. Hari, Juhani, do you guys have any additional feedback on this proposal? Thanks, Mike Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614991#comment-13614991 ] Roshan Naik commented on FLUME-1227: I am not particularly wedded to the current approach. My first attempt based on your suggestion to inline the config of overflow channel in the SC itself. I discovered some [serious issues|https://issues.apache.org/jira/browse/FLUME-1227?focusedCommentId=13540116page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13540116] with it and so I pursued the alternative that had been discussed (but w/o consensus). Intent was to get the less contentious core logic working and return quickly to this phase of getting feedback on these shaky parts. - Since you mention it, explicitly depending on FC ( i assume by invoking 'new FileChannel()' inside SC ) ... has not been discussed. It might be worth considering. - Forking FC / Creating yet another durable channel : This has talked about and concerns have been with duplication of code (perhaps the most complex piece Flume code). I think Juhani also noted the same. I too am concerned about that. If forked.. each FC bug would have to fixed in 2 places. FC seems to keep evolving, and the for will likely become stale. I wonder, if it makes sense to derive a class from FC and use it as overflow instead. - Your unresolved code review Question: We spoke about this when we met at the Flume meetup. On restart the overflow is drained completely first. It is addressed in the design doc under 'recovery from failures' but perhaps not very clearly. - Yes, if SC does not have to guarantee strict ordering, then as long as counts in DOQ are correct, things will work fine. Ordering guarantees from overflow are needed only if SC is reqd to provide ordering guarantee. We already have a consensus that SC will not rely on any non-explicit FC guarantees. - I totally agree with Hari and yourself on transactionCapacity issue. It makes total sense to expose channel size and capacity at the channel interface. I didn't do it in the first patch as I was afraid it might become a big point of contention. Perhaps a misplaced fear. MemC,FC JdbcC may need minor tweaks for it. If there are no objections i can go ahead and make this change. I think now the only remaining open issue is how to deal with Overflow. Let me list the options that have been put forward so far and some more : 1) User specifies in config which channel to use as overflow : Current approach and has given me all the grief that i anticipated :) 2) Fork FC / create yet another durable FC like store. Then embed it into SC. Some comments have been made on this already. 3) Explicitly instantiate FC directly inside SC. 4) Derive another class from FC and embed it into SC. 5) Based on Mike comment about SinkProcessors... Does it make sense to experiment with the notion of ChannelProcessors ? 6) Any other ideas ? Now would be THE time to speak. Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord,
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613418#comment-13613418 ] Mike Percy commented on FLUME-1227: --- Roshan, thanks a lot for this design documentation. Guys, based on my prior [reviewboard comment|https://reviews.apache.org/r/9544/] one big problem I have with this implementation is the way that the channels are allowed to know about each other. I am completely against this because it violates separation of responsibilities and encourages unmaintainable spaghetti dependencies between components. What's next, sinks? That is why we have SinkProcessors (so sinks don't have to know about each other). We simply cannot afford to open that Pandora's box. Let the SpillableChannel instantiate its own dependencies and govern their lifecycle. If explicitly depending on the file channel is a problem, then let's talk about ways to mitigate that... either forking a copy of the FC code into SC so that FC can evolve separately, or explicitly not relying on ordering in SC, if that is the issue. Therefore SC would not have ordering guarantees. Can the Drain Order Queue survive that situation? It makes me a little nervous that DOQ even exists to be honest... I don't really like it. It seems like a somewhat complex and brittle mechanism for achieving this spill functionality. But I would not block this patch because I'm not in love with the DOQ. And I think if the SC doesn't have to guarantee order then as long as its counts are correct then it should still work. Correct me if I'm wrong. If specific non-explicit guarantees of the FC are being relied on then an alternative is to consider a different design that relies on different invariants than the DOQ does. I'm not necessarily advocating for that, I'm just throwing it out there as an option. But I'd be happy with forking the FC and getting this checked in without a total redesign to make progress if that addresses others' concerns. My other as-yet unresolved item of code review feedback involved what happens when the agent is stopped then restarted while the channel has events in both the primary and secondary channels. Can this please be addressed as well? Additionally, I agree with Hari on the use of transactionCapacity as a poor substitute for a reservation amount on the underlying channels. We need a better way, and if exposing channel size and capacity via an interface will help then I'm all for it. Regards, Mike Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13611899#comment-13611899 ] Roshan Naik commented on FLUME-1227: - I concur that unspecified guarantees should not be depended upon. I can drop that assumption from the tests. - I think its very important to not continue to leave the guarantees unspecified. But that's for another Jira. - WRT to deferring the decision to commit() time. Let me revisit that issue. *Instantiationa config*: For discussion, I would like to treat instantiation (new up the object) separate from life cycle (start/stop). Since existing instance may get reused during reconfigure. Overflow does not need to be instantiated or configured before SC! Just like sources, sinks and channels can be instantiated and configured independently in any order. Only start/stop needs to co-ordinated between the two. Also we need to ensure that SC is not able to get a reference to overflow if overflow had configuration errors. All components (sinks/sources/channels) get introduced to each other after they are correctly configured. There is already a step to introduce configured sinks and sources to their channels. I have extended that step to introduce channels to each other. The current implementation is a bit permissive and could be tightened up so that SC is limited to obtaining a handle only its overflow (not other channels). *Life cycle*: Hari, Correct me if you think its not the case, but i think the current design is in tune with your desire that the SC owns the lifecycle (start/stop) of the overflow. Config subsystem merely instantiates, configures and introduces the two channels to each other. Thereafter it disowns the lifecycle of overflow and lets the SC manage overflow's lifecycle. It retains ownership of SC's lifecycle however. This is nice because we dont have to replicate solutions to some of the config related aspects in SC. We don not have to worry about the order in which channels are instantiated and configured, and at the same time gain control over the order in which the start/stop is called on the SC and its overflow. *Scribe*: Juhani, I think spilling policy can we definitely tweaked. Right now I spill into overflow only when primary is full. I like the idea that we can take a cue from the fact that takes() have begun to fail and start spilling early to minimize data loss. There is a throughput concern that I have with Scribe's operating mode where it switches exclusively to using either memory or disk. In SC's design we do not need to wait for the overflow to completely drain before resuming the use of the faster primary. I'll look more into scribe and see what we can leverage. - The fsync experiment is something i would like to defer and resolve other open items. It does not look like a blocker and more of a perf tuning thing. does that sound reasonable ? Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13610111#comment-13610111 ] Juhani Connolly commented on FLUME-1227: I would personally prefer seeing a dependence on existing channels than another implementation of something like the file channel and something like the memory channel. The code-base is already getting pretty big, and the interfaces are fixed. The spillable channel shouldn't even know or care about what type the main/sub channel are, just feed them data. While it might not be the most optimal solution performance-wise, I think the cost would be small and it would give us less code to maintain overall. Either approach certainly has its merits. Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13608700#comment-13608700 ] Roshan Naik commented on FLUME-1227: Thanks Hari. 1) WRT the concern on not depending on another channel, i went down this path since it looked like there was some consensus when i started. What alternative design do you have in mind ? 2) WRT change in memory/file channel breaking the Spillable channel: Could you expand a bit ? I am not familiar with replay order issue and how it can impact. I dont think there is any intrinsic assumption being made wrt to any specific channel's behavior. Just to be doubly sure, i made sure not to rely on a single type of overflow channel in all the tests. The only material dependency (as far as I can tell) that Spillable Channel has on the overflow is the interface level guarantee that is expected from all channels: that order is maintained in case of single source/sink. Do you see any other assumptions/dependencies hiding there ? 3) WRT reserving capacity on both channels. If you mean that each txn should not reserve capacity on both channels. I agree. And the current implementation does not do that. Or were you by any chance referring to the issue of upfront reservation (at put() time) versus commit() time ? 4) WRT to testing with fsyncs removed, i have not pursued it since i felt that would be compromising the durability guarantees. Do you think its useful to do that ? 5) WRT we should make the configuration change. Can you elaborate ? I am not certain which change specifically you are referring to. Or are you referring to the whole config approach ? 6) WRT lifecycle management and dependencies : After configuration, any channel that is found to be not connected with a source/sink is automatically discarded from the list of Life cycle system managed components. Consequently the Spillable Channel becomes the sole life cycle manager of the overflow channel. Otherwise, yes there would be havoc. Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13609360#comment-13609360 ] Hari Shreedharan commented on FLUME-1227: - {quote} 1) WRT the concern on not depending on another channel, i went down this path since it looked like there was some consensus when i started. What alternative design do you have in mind ? 2) WRT change in memory/file channel breaking the Spillable channel: Could you expand a bit ? I am not familiar with replay order issue and how it can impact. I dont think there is any intrinsic assumption being made wrt to any specific channel's behavior. Just to be doubly sure, i made sure not to rely on a single type of overflow channel in all the tests. The only material dependency (as far as I can tell) that Spillable Channel has on the overflow is the interface level guarantee that is expected from all channels: that order is maintained in case of single source/sink. Do you see any other assumptions/dependencies hiding there ? {quote} I am sorry, I was not part of the initial discussions - so I was not aware of the consensus aspect. What I am saying is that being dependent on another channel creates an undesired strong coupling between this channel and the other channels. An if there are unit tests in this channel which can break if one of the other channels' behavior is changed, then it is not something that is acceptable. If you look at all our other components, none of them have a dependence on each other (except the RPCClients - that is because the sinks are just glorified RPCClients). The reason I would not agree with even the single source/sink replay order is that our interfaces do not really enforce this. This is not really even enforced anywhere in the documentation either. The FileChannel did not even conform to that single source/sink replay order until FLUME-1432. In fact, conforming to that order even in FLUME-1432 was a side-effect of fixing a race condition, and not specifically because it was meant to be handled. At some point, if it is decided this can change again to some other order (maybe a thread based ordering, or or an order in which events in a transaction will all get written out together on commit, rather than getting written out on put and fsynced on commit), then if this channels' tests break, the onus will be on the contributor who submitted the file channel change to fix it - which I do not agree with. In summary, I am ok with depending on other channels. What I am not ok with is depending on the behavior of those channels, which are not explicitly guaranteed through interfaces (or even documentation). bq. 3) WRT reserving capacity on both channels. If you mean that each txn should not reserve capacity on both channels. I agree. And the current implementation does not do that. Or were you by any chance referring to the issue of upfront reservation (at put() time) versus commit() time ? I am talking about put v/s commit time. In most cases, transaction capacity is often configured to be much higher than the the max expected in most cases. I would suggest doing a full implementation where there is a transaction outside, and a backing store inside. Once the transaction is about to get committed, then decide where the events go. (It is going to be tricky to do this and avoid doing all the writes at once - the File Channel fsyncs on commit, but writes to OS buffers on every write - so it is possible some data is flushed to disk before explicit fsyncs). This is not a blocker anyway, we can work on it later as well. bq. 4) WRT to testing with fsyncs removed, i have not pursued it since i felt that would be compromising the durability guarantees. Do you think its useful to do that ? I was wondering whether simply adding a config param to change the fsyncs (fsync all files before checkpoint in parallel or something) to optional will give comparable performance to what is being proposed in this jira. I have a feeling it might, since fsyncs are the most expensive part of the file channel, and removing the fsyncs just writes to the in-memory OS buffer and the fsyncs will be taken care of in the background. {quote} 5) WRT we should make the configuration change. Can you elaborate ? I am not certain which change specifically you are referring to. Or are you referring to the whole config approach ? 6) WRT lifecycle management and dependencies : After configuration, any channel that is found to be not connected with a source/sink is automatically discarded from the list of Life cycle system managed components. Consequently the Spillable Channel becomes the sole life cycle manager of the overflow channel. Otherwise, yes there would be havoc. {quote} I just think we should not allow one component to pull a reference to another component in the system. This explicitly breaks the
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13609824#comment-13609824 ] Juhani Connolly commented on FLUME-1227: I had a look at the design doc and comments so just thought I'd chip in. So long as we're only depending on the Channel interface for behaviors, I think we're good, I believe this was the intention in an earlier proposal of this feature. I agree with Hari about ordering. It's not a guarantee we enforce in flume, and while nice, I think that it over-complicates things. As to lifecycle management, I don't necessary feel that having a channel own it's sub-channels is a particularly good precedent. I think it would be preferable that we allow the lifecycle manager to return interfaces rather than having components creating other components explicitly. Configuration would have to have some grasp of dependencies though... Sub-channels would need to be instantiated before the owner As to the fsync thing: definitely should be an option. Separate issue though. Making it possible to disable it would be great. Since this depends on in memory data, durability really shouldn't be an issue. If you have data in memory, it doesn't really matter if it's in the memory channel or in the OS file buffer One thing you may want to consider is the approach taken by scribed(which has other problems, but the buffer store implementation is very nice): - Default to using the main channel - Upon a next hop failure(roll back of take transaction in our case), switch to a buffering mode. All data is sent to the buffer channel until recovery. One may want to move the contents of the primary channel to the buffer if maintaining ordering is an objective. This could also reduce loss of data. - During buffering mode, puts and takes go to the buffer channel, until it has been drained. Once it has been drained, return to streaming mode where operations are performed against the primary channel. Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13609911#comment-13609911 ] Hari Shreedharan commented on FLUME-1227: - Hi Juhani, Thanks for you comments. I agree with most of what you have mentioned. {quote} As to lifecycle management, I don't necessary feel that having a channel own it's sub-channels is a particularly good precedent. I think it would be preferable that we allow the lifecycle manager to return interfaces rather than having components creating other components explicitly. Configuration would have to have some grasp of dependencies though... Sub-channels would need to be instantiated before the owner {quote} I agree with your last statement. Configuration will also need to detect cycles etc so that you don't have a cycle of interdependent components. I don't particularly like the idea of passing references of existing channels to others to use as sub-channels - something that I don't like, but won't block since there seems to have been some consensus regarding this earlier. I frankly think 2 channels within the same one is overkill. I think this channel can be easily implemented by using a mmap-ed file which is never specifically fsync-ed. This might cause some page faults etc., but the page cache management is usually smart enough to not cause this to affect performance a whole lot - this implementation is likely to be faster too (in fact, this is very similar to the File Channel checkpoint class). Using this as a cyclic buffer would probably be as good, and gives the same guarantees as the memory channel (which is what we are targeting in this jira, I suppose?). Also, I like the implementation you have mentioned above, though this can be quite tricky to get right. Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13606137#comment-13606137 ] Hari Shreedharan commented on FLUME-1227: - Roshan, Sorry it took me this long to get to this one. I reviewed the design document and I have a couple of relatively major concerns: #. This channel implicitly depends on the behavior of current channels - the File Channel and Memory Channel. As one of the people who maintain the file channel, I strongly feel this is not the correct thing to do. It is possible that behavior of the File Channel or the Memory Channel could change (This is not without precedent. In FLUME-1437, we did change the replay order). At that point, a change in the behavior of the File Channel or Memory Channel would break unit/integration tests for this channel - which could delay a commit. #. I don't think we should make the configuration change. The idea of the Lifecycle manager is to handle all the components and make them independent of each other. Dependencies on other components managed by the Lifecycle system is a bad idea. This also sets a bad precedent. This can lead to patches that make component inter-dependent and depend on the other component being a particular one (example a source using this hook to figure out if it is operating on Memory Channel or File Channel). I believe the current design is a bit more complex than it needs to be - due to the handling of more than one transaction. Also reserving transaction capacity on both channels is a bad indicator of where the txn should go. In my experience, people do set the transaction capacity to a value much higher than the average transaction. Also, have you tested this against a slightly modified File Channel with all of the fsyncs removed (or commented out)? I'd be interested in seeing the difference in performance at that point. Also, see FLUME-1423 where Denny removed the fsyncs for performance (the performance of the channel has improved even more since then though). Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13603141#comment-13603141 ] Roshan Naik commented on FLUME-1227: Looking to revive attention on this one. Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1, SpillableMemory Channel Design.pdf I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588425#comment-13588425 ] Brock Noland commented on FLUME-1227: - Same as Mike. [~hshreedharan] any time for a review? Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1 I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588624#comment-13588624 ] Hari Shreedharan commented on FLUME-1227: - I can take a quick look later today, though I can't promise when I can do a full review. Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik Attachments: 1227.patch.1 I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13540116#comment-13540116 ] Roshan Naik commented on FLUME-1227: Seeking input .. The current configuration system does not look conducive to chaining channels. Here are the config techniques that has been previously talked about : 1) Out-of-line: agent1.channels = channel1 channel2 agent1.channels.channel1.type = SPILLABLE agent1.channels.channel1.overflow = channel2 agent1.channels.channel2.type = FILE agent1.channels.channel2.checkpointDir = /path1 ... The problem here is that .. - At the time channel1 is configured, channel2 may not have been instantiated yet. So it is not possible to latch on to an instance of channel2. So it may be better to defer obtaining a reference to the overflow channel at start time. - No mechanism to get a reference to one channel from another (in this case, at start time) 2) Inline: (as suggested by Mike) agent1.channels = channel1 agent1.channels.channel1.type = SPILLABLE agent1.channels.channel1.overflowChannel.type = FILE agent1.channels.channel1.overflowChannel.checkpointDir = /path1 agent1.channels.channel1.overflowChannel.dataDirs = /path2 ... The issue here is that the instantiation and configuration of the overflow channel will now have to reside inside SpillableChannel::configure(). This method is not a very conducive place for doing such things. 3) Hard coding Basically hard code the file channel to be the overflow channel. this allows the file channel to be easily instantiated and configured. downside is that it still duplicates the channel instantiation/config logic from AbstractConfigurationProvider.loadChannels() Any thoughts ? Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Roshan Naik I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13510888#comment-13510888 ] Mike Percy commented on FLUME-1227: --- Hey Roshan, sounds good to me except I'd recommend trying this out with a brand new channel that delegates to a memory channel, in order to minimize the risk of destabilizing what is a very solid and important core component. Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13510988#comment-13510988 ] Roshan Naik commented on FLUME-1227: You mean we conceptually create a new MemChannel++ ? where the ++ part is basically the overflow ability ? Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13511024#comment-13511024 ] Mike Percy commented on FLUME-1227: --- Right. Or we could call it SpillableChannel I guess. :) I don't have a strong opinion on the name, personally. Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13506841#comment-13506841 ] Roshan Naik commented on FLUME-1227: Hi Mike.. yes you are right.. i think it is a downside of that algorithm. i realized the same after posting that comment. Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504893#comment-13504893 ] Roshan Naik commented on FLUME-1227: Thanks for those valuable thoughts Mike. I have described an algorithm for puts/takes [here|https://issues.apache.org/jira/browse/FLUME-1227?focusedCommentId=13493481page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13493481]. It should solve the ordering problem, handle transactions correctly and maximize throughput. Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13505021#comment-13505021 ] Brock Noland commented on FLUME-1227: - If we move forward with this proposal, I think it'd be great to see a design document. Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504291#comment-13504291 ] Roshan Naik commented on FLUME-1227: Continuing the discussion... I spent some time studying the discussions in the jiras related to solving the problem of spilling over (and/or failover). I think failover and spillover should not be conflated to be the same problem ... even though it may be possible to address them both in the same solution. There is a consensus that the problem worth addressing. There are concerns hovering around these dimensions. 1) complexity of implementation and configuration. also potentially [enhancements|https://issues.apache.org/jira/browse/FLUME-1045?focusedCommentId=13430529page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13430529] to existing interfaces 2) complexity of testing 3) Ensuring transaction guarantees are preserved and its weakness/strength level 4) Defining the durability level (durable or not) of the final solution .. this is simple IMHO 5) Efficiency of the solution (batching requests during when spilling over) 6) Flexibility So far the solutions discussed along with their concerns .. 1) FailOver Sink processor - has issues with retaining transaction guarantees ([Reference|https://issues.apache.org/jira/browse/FLUME-1045?focusedCommentId=13235705page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13235705]) 2) Mechanisms for Composing Existing Channels ([1201|https://issues.apache.org/jira/browse/FLUME-1201] and [my proposal|https://issues.apache.org/jira/browse/FLUME-1227?focusedCommentId=13492828page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13492828]) - Flexible but has complexities in regards to testing ([mixed opinions here|https://issues.apache.org/jira/browse/FLUME-1201?focusedCommentId=13282018page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13282018]), implementation determining durability [See|https://issues.apache.org/jira/browse/FLUME-1045?focusedCommentId=13235705page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13235705] 3) Spillable Channel - Limited functionality but easier to test and determine transaction+durability semantics. My thoughts... The concerns related to mechanisms for composing channels is largely centered around complexities. I feel some of them are not true. Testing a composition mechanism is not as complex as it has been feared for reasons stated [here|https://issues.apache.org/jira/browse/FLUME-1201?focusedCommentId=13282018page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13282018]. In a pluggable system (like rest of flume) we rely on guarantees from the interface itself. There is no need to test all combination of all possible channels for testing. Just like it does not make sense to test all combinations of sink/channel/source/interceptors/sink-processors in Flume. Implementation of a composite mechanisms would also be simpler. It would be focussed only around issues involved in stitching channels. Not in actually providing a robust backing store. Spillover channel (Mem + File) seems a little too specialized .. for instance it does not provide durability for users if needed. It is nice to allow the primary channel to be on a fast smaller durable store (like SSDs) and overflow into a another slower durable store (like hard disk /jdbc) the following general strategy for compounding channels seems worth discussing .. agent1.channels.compoundChannel.type = compound agent1.channels.compoundChannel.1 = memChannel1 agent1.channels.compoundChannel.2 = fileChannel1 agent1.channels.compoundChannel.3 = jdbcChannel1 agent1.channels.compoundChannel.1.overflowBatchSize = 100 # batch size when spilling into fileChannel1 agent1.channels.compoundChannel.2.overflowBatchSize = 1000 # batch size when spilling into jdbcChannel1 Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13495737#comment-13495737 ] Roshan Naik commented on FLUME-1227: Looks like this jira is up for grabs ?? If there is agreement that my proposal is a good way forward I would like to pick it up. Thoughts ? Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13495778#comment-13495778 ] Roshan Naik commented on FLUME-1227: actually i think.. this proposal, if acceptable, would have to be a different jira. since the current jira is about introducing a new channel. Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13495785#comment-13495785 ] Hari Shreedharan commented on FLUME-1227: - Roshan - that might be a good thing to do - but there was a discussion about a compound channel several months ago, and I believe the consensus was that it would be too complex to write and even more complex to test. But feel free to file a jira - I am sure there will be a healthy discussion. Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13495798#comment-13495798 ] Bernardo de Seabra commented on FLUME-1227: --- I like this approach (quite popular with Scribe) but my only concern is around performance. You would get unexpected/unpredictable performance impact on disk IO which could be (in our case it would be) impacting your application if flume and the app are sharing the same disk. It's a tradeoff. Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493481#comment-13493481 ] Roshan Naik commented on FLUME-1227: I agree Scribe's policy is sub optimal. It is better to prioritize the parent channel whenever it has spare capacity and still maintain order. To achieve this I have a simple algorithm in mind... The parent channel maintains a 'drain order' queue of signed numbers which indicates at anytime the order in which the items in it and its overflow channel should be drained. For instance the following numbers in that queue [3,-2,6,-1] indicate the following drain order: - drain 3 from self - then drain 2 from overflow - then 6 from self - then 1 from overflow The channel's put() will update its drain order queue (DOQ) as follows: if(I have capacity) { + add event to my DOQ + if last element in DOQ is +ve then increment it + else push +1 to DOQ } else { + Call put() on overflow + if last element in DOW is -ve then decrement it + else push -1 to DOQ } I think the take() should be obvious. Obviously corner cases like empty self and empty overflow need to be handled appropriately.. but this is just capturing the idea. Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493500#comment-13493500 ] Roshan Naik commented on FLUME-1227: apologies for email storm created by multiple edits to my prev comment. Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13492828#comment-13492828 ] Roshan Naik commented on FLUME-1227: I dont see this option discussed but it seems interesting (and IMO avoids some of the issues in sink triggered spooling as discussed in FLUME-1045). Basically instead of adding another Spillable channel which is logically a composite of mem file channels, we could add a config directive to Memory Channel such as: agent1.channels.memChannel1.overflow = fileChannel1 Basically, there would be a preconfigured file channel (or jdbc or some custom channel) into which memory channel would simply spill over events into when capacity has been reached. There should be no other sources or sinks tied to an overflow channel. Ideally any channel should be able to use another channel for overflow. Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Patrick Wendell I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13492876#comment-13492876 ] Juhani Connolly commented on FLUME-1227: Interesting suggestion... When would you suggest that the overflow channels contents be read, and by what component? Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Patrick Wendell I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13492923#comment-13492923 ] Roshan Naik commented on FLUME-1227: The parent channel's put()/take() will be the source/sink for its overflow channel. For the special case of just supporting it in memory channel, I think it could easily employ whatever policy the SpillableChannel would have used. For the more general case of making this a cross-cutting feature available to all channels with the ability to chain, i would conjecture, it may be possible to use the same policy at each level of the chain. So this policy could be pushed into the common base class for channels. Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Patrick Wendell I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13491154#comment-13491154 ] Rahul Ravindran commented on FLUME-1227: Is there a timeline on when this new channel would be out? Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Patrick Wendell I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13491209#comment-13491209 ] Mike Percy commented on FLUME-1227: --- I don't know of anyone actively working on this... Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Patrick Wendell I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431609#comment-13431609 ] Juhani Connolly commented on FLUME-1227: Since the channel is not aware of the state of sinks, I think Jareks proposed method sounds good. In another place, it was pointed out that we cannot just change the interface as it will break peoples custom components. However I think you can get away with a similar method to configurable now. Add a CapacityPollable interface or something, and check whether the channel implements it, polling if it exists. In the case of non-existence you will just have to rely on catching exceptions as an indicator of problems) Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Patrick Wendell I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13430470#comment-13430470 ] Seetharam Venkatesh commented on FLUME-1227: Does this mean there is no effort going into FLUME-1045? Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Patrick Wendell I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel
[ https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428531#comment-13428531 ] Denny Ye commented on FLUME-1227: - That's great and useful when Flume cannot reaches to HDFS or other destination. Also it's the same concept in Scribe with named 'primary store' and 'secondary store'. Wish any implementations. Introduce some sort of SpillableChannel --- Key: FLUME-1227 URL: https://issues.apache.org/jira/browse/FLUME-1227 Project: Flume Issue Type: New Feature Components: Channel Reporter: Jarek Jarcec Cecho Assignee: Jarek Jarcec Cecho I would like to introduce new channel that would behave similarly as scribe (https://github.com/facebook/scribe). It would be something between memory and file channel. Input events would be saved directly to the memory (only) and would be served from there. In case that the memory would be full, we would outsource the events to file. Let me describe the use case behind this request. We have plenty of frontend servers that are generating events. We want to send all events to just limited number of machines from where we would send the data to HDFS (some sort of staging layer). Reason for this second layer is our need to decouple event aggregation and front end code to separate machines. Using memory channel is fully sufficient as we can survive lost of some portion of the events. However in order to sustain maintenance windows or networking issues we would have to end up with a lot of memory assigned to those staging machines. Referenced scribe is dealing with this problem by implementing following logic - events are saved in memory similarly as our MemoryChannel. However in case that the memory gets full (because of maintenance, networking issues, ...) it will spill data to disk where they will be sitting until everything start working again. I would like to introduce channel that would implement similar logic. It's durability guarantees would be same as MemoryChannel - in case that someone would remove power cord, this channel would lose data. Based on the discussion in FLUME-1201, I would propose to have the implementation completely independent on any other channel internal code. Jarcec -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira