[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2016-08-15 Thread Laxman (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422202#comment-15422202
 ] 

Laxman commented on FLUME-1227:
---

[~roshan_naik], we are planning to use this channel. But found that this does 
not persist in-memory data on shutdown. Found FLUME-2396 has been filed for the 
same. IMHO, dataloss in a channel with persistence may not be acceptable. I can 
work with you, if you feel this should be fixed.

> Introduce some sort of SpillableChannel
> ---
>
> Key: FLUME-1227
> URL: https://issues.apache.org/jira/browse/FLUME-1227
> Project: Flume
>  Issue Type: New Feature
>  Components: Channel
>Affects Versions: v1.4.0
>Reporter: Jarek Jarcec Cecho
>Assignee: Roshan Naik
> Fix For: v1.5.0
>
> Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, 
> FLUME-1227.v6.patch, FLUME-1227.v7.patch, FLUME-1227.v8.patch, 
> FLUME-1227.v9.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory 
> Channel Design.pdf
>
>
> I would like to introduce new channel that would behave similarly as scribe 
> (https://github.com/facebook/scribe). It would be something between memory 
> and file channel. Input events would be saved directly to the memory (only) 
> and would be served from there. In case that the memory would be full, we 
> would outsource the events to file.
> Let me describe the use case behind this request. We have plenty of frontend 
> servers that are generating events. We want to send all events to just 
> limited number of machines from where we would send the data to HDFS (some 
> sort of staging layer). Reason for this second layer is our need to decouple 
> event aggregation and front end code to separate machines. Using memory 
> channel is fully sufficient as we can survive lost of some portion of the 
> events. However in order to sustain maintenance windows or networking issues 
> we would have to end up with a lot of memory assigned to those "staging" 
> machines. Referenced "scribe" is dealing with this problem by implementing 
> following logic - events are saved in memory similarly as our MemoryChannel. 
> However in case that the memory gets full (because of maintenance, networking 
> issues, ...) it will spill data to disk where they will be sitting until 
> everything start working again.
> I would like to introduce channel that would implement similar logic. It's 
> durability guarantees would be same as MemoryChannel - in case that someone 
> would remove power cord, this channel would lose data. Based on the 
> discussion in FLUME-1201, I would propose to have the implementation 
> completely independent on any other channel internal code.
> Jarcec



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2014-02-28 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13916097#comment-13916097
 ] 

Hari Shreedharan commented on FLUME-1227:
-

[~roshan_naik] - When we roll 1.5, jiras with no fix versions will be updated.

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, 
 FLUME-1227.v6.patch, FLUME-1227.v7.patch, FLUME-1227.v8.patch, 
 FLUME-1227.v9.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory 
 Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2014-02-27 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915146#comment-13915146
 ] 

Hari Shreedharan commented on FLUME-1227:
-

+1. I am going to run tests and commit this one. Since this is being marked as 
experimental, I made a change in the user guide to clarify it is not 
recommended for production use. 

I also made some minor indentation changes in SpillableMemoryChannel.java

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, 
 FLUME-1227.v6.patch, FLUME-1227.v7.patch, FLUME-1227.v8.patch, 
 FLUME-1227.v9.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory 
 Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2014-02-27 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915184#comment-13915184
 ] 

ASF subversion and git services commented on FLUME-1227:


Commit d5805c8598be4eec85de8973b4c98ecdd7ffe6d3 in flume's branch 
refs/heads/flume-1.5 from [~hshreedharan]
[ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=d5805c8 ]

FLUME-1227. Introduce Spillable Channel.

(Roshan Naik via Hari Shreedharan)


 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, 
 FLUME-1227.v6.patch, FLUME-1227.v7.patch, FLUME-1227.v8.patch, 
 FLUME-1227.v9.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory 
 Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2014-02-27 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915183#comment-13915183
 ] 

ASF subversion and git services commented on FLUME-1227:


Commit 6a50ec2ad33b8cbd057907c67030d855520c5f13 in flume's branch 
refs/heads/trunk from [~hshreedharan]
[ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=6a50ec2 ]

FLUME-1227. Introduce Spillable Channel.

(Roshan Naik via Hari Shreedharan)


 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, 
 FLUME-1227.v6.patch, FLUME-1227.v7.patch, FLUME-1227.v8.patch, 
 FLUME-1227.v9.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory 
 Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2014-02-27 Thread Roshan Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915232#comment-13915232
 ] 

Roshan Naik commented on FLUME-1227:


Should we set the 'fix version'  to 1.5 ?

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, 
 FLUME-1227.v6.patch, FLUME-1227.v7.patch, FLUME-1227.v8.patch, 
 FLUME-1227.v9.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory 
 Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2014-02-26 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13913186#comment-13913186
 ] 

Otis Gospodnetic commented on FLUME-1227:
-

Was just about to write to the ML asking about this functionality.  Looks like 
all known issues have been fixed, plus this is new functionality, so it should 
go in and get some real-world action, which we'd love to give it as soon as 
1.5.0 is out!

+10 for committing this.  Any chances of this going in before 1.5.0 is cut?


 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, 
 FLUME-1227.v6.patch, FLUME-1227.v7.patch, FLUME-1227.v8.patch, 
 FLUME-1227.v9.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory 
 Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2014-02-26 Thread Thilo Seidel (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13913190#comment-13913190
 ] 

Thilo Seidel commented on FLUME-1227:
-

Guten Tag,
Ich bin heute nicht im Büro. Ihre Mail wird bis zu meiner Rückkehr weder 
gelesen noch automatisch weitergeleitet.
Viele Grüße
Thilo Seidel


 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, 
 FLUME-1227.v6.patch, FLUME-1227.v7.patch, FLUME-1227.v8.patch, 
 FLUME-1227.v9.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory 
 Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2014-01-24 Thread Roshan Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13880852#comment-13880852
 ] 

Roshan Naik commented on FLUME-1227:


[~hshreedharan] if there are no other comments.. could you look into committing 
this ?

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, 
 FLUME-1227.v6.patch, FLUME-1227.v7.patch, FLUME-1227.v8.patch, 
 FLUME-1227.v9.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory 
 Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2014-01-15 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872552#comment-13872552
 ] 

Hari Shreedharan commented on FLUME-1227:
-

[~roshan_naik] - Is this ready for review (since you have not hit Submit 
Patch)?

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, 
 FLUME-1227.v6.patch, FLUME-1227.v7.patch, FLUME-1227.v8.patch, 
 FLUME-1227.v9.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory 
 Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-12-26 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13857070#comment-13857070
 ] 

Brock Noland commented on FLUME-1227:
-

Thank you for addressing the feedback!  I am OK with your reasoning regarding 
adding dual checkpointing to the example. I haven't looked at this code and 
review in detail. It looks like Hari has, so I think he'll have to make the 
call of when to commit.

Thank you for your hard work Roshan!

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, 
 FLUME-1227.v6.patch, FLUME-1227.v7.patch, FLUME-1227.v8.patch, 
 FLUME-1227.v9.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory 
 Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-12-19 Thread Roshan Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13853707#comment-13853707
 ] 

Roshan Naik commented on FLUME-1227:


thanks for the feedback [~brocknoland] 
Will incorporate ur feedback and update the patch soon. 

WRT to the adding notes on file channel best practices into Spillable Channel 
section, i am not too hot on that unless it has specifically to do with its 
coupling with Spillable channel. In (FLUME-2239) recently I made a note about 
multiple data dirs helping file channel perf.  Also the dual checkpoint feature 
is broken on Windows(FLUME-2224). Let me know if you feel otherwise.

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, 
 FLUME-1227.v6.patch, FLUME-1227.v7.patch, FLUME-1227.v8.patch, 
 SpillableMemory Channel Design 2.pdf, SpillableMemory Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-12-18 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13851768#comment-13851768
 ] 

Brock Noland commented on FLUME-1227:
-

Hey, I have not participated in the review til now so sorry about this...but I 
just noticed the following items which are mostly nits and improvements.

SpillableMemoryChannel
1. Static stuff should be at the top
2. Constructor should be directly below fields
3. String constants should be static final fields with javadoc description
4. Stuff can be final:
{noformat}
private Object queueLock = new Object();
{noformat}

TestSpillableMemoryChannel
1. Take null has a commented out assertion
2. There are locations where we expect Exception that should be a specific 
type of exception.
3. Let's not use e.printStackTrace();
4. Places we assert boolean should have a message
5. Many missing spaces such as:
{noformat}
for (int i=0; icount; ++i) {
{noformat}
and
{noformat}
nullsFound=count;
{noformat}

Docs

1. Please specify multiple data directories in the examples and add a note
that file channel performance will increase dramatically with multiple disks.

2. Add dual checkpoint to the examples as that is a good practice.

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, 
 FLUME-1227.v6.patch, FLUME-1227.v7.patch, FLUME-1227.v8.patch, 
 SpillableMemory Channel Design 2.pdf, SpillableMemory Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-12-17 Thread Roshan Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13850319#comment-13850319
 ] 

Roshan Naik commented on FLUME-1227:


Hi [~hshreedharan].. i have addressed most of your comments locally.. but will 
need another day to address your comments on incorrect counter  test issue.  
it needs some thinking through on my part.. thanks for catching them.

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, 
 FLUME-1227.v6.patch, FLUME-1227.v7.patch, SpillableMemory Channel Design 
 2.pdf, SpillableMemory Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-12-16 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13849852#comment-13849852
 ] 

Hari Shreedharan commented on FLUME-1227:
-

Hey [~roshan_naik] - Any updates here?

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, 
 FLUME-1227.v6.patch, FLUME-1227.v7.patch, SpillableMemory Channel Design 
 2.pdf, SpillableMemory Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-12-09 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843403#comment-13843403
 ] 

Hari Shreedharan commented on FLUME-1227:
-

Hi Roshan,

In the takePrimary and takeOverflow methods, there is a 
Preconditions.checkArgument method where like you mentioned in takePrimary 
method comments, there is an int-Integer-String conversion in a hot path 
(this is handled with an if in the takePrimary method, not in takeOverflow) - 
can you get rid of the the preconditions call, and just do:

if (...) {
throw IllegalStateException(..)
}.

This for one is cleaner, since the if already checks for the issue and we can 
avoid an unneeded method call.

Is this because rolling back the overflow txn will ensure that the event goes 
back into the file channel and you don't need to handle it?

{code}
 if (!useOverflow) {
  takeList.offer(event);  // takeList is thd pvt, so no need to do this 
in synchronized block
}
{code}

If that is the case the counters are incorrect when the transaction committed 
is overflow transaction, since this is how they are updated:
{code}
channelCounter.addToEventTakeSuccessCount(takeList.size());
{code}

Even this is not accurate:
{code}
  if (takeList.size()  largestTakeTxSize)
largestTakeTxSize = takeList.size();
{code}


There are also a couple issue with regards to failed transactions when writing 
to primary (granted it is a queue and it should not fail, but if a lock acquire 
gets interrupted, it can still fail). The memQueueRemaining semaphore has 
already been updated before pushing the events to the queue (that is definitely 
the right thing to do), but if a queue.offer fails the memQueueRemaining is not 
updated. This might be an issue with the current channels too - and is 
sufficiently rare to say we can revisit this later.

Also there is a possibility of partially successful transactions right now (if 
the queue inserts fail - that I guess is true for all channels right now, so I 
guess we can live with it - just mentioning it to ensure that we know it is a 
possibility and we can revisit if needed).


 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, 
 FLUME-1227.v6.patch, FLUME-1227.v7.patch, SpillableMemory Channel Design 
 2.pdf, SpillableMemory Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-12-09 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843421#comment-13843421
 ] 

Hari Shreedharan commented on FLUME-1227:
-

Also, there are several lines  80 characters. Can you make sure that you fix 
this too. For comments, please put the comments before the relevant line if 
they are expected to be long.

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, 
 FLUME-1227.v6.patch, FLUME-1227.v7.patch, SpillableMemory Channel Design 
 2.pdf, SpillableMemory Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-12-09 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843442#comment-13843442
 ] 

Hari Shreedharan commented on FLUME-1227:
-

The patch seems to be failing tests :
{code}
---
Picked up _JAVA_OPTIONS: -Djava.awt.headless=true
Running org.apache.flume.channel.TestSpillableMemoryChannel
Tests run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 103.657 sec 
 FAILURE!
testTotalStoredSemaphore(org.apache.flume.channel.TestSpillableMemoryChannel)  
Time elapsed: 2923 sec   FAILURE!
java.lang.AssertionError: expected:0 but was:4500
at org.junit.Assert.fail(Assert.java:93)
at org.junit.Assert.failNotEquals(Assert.java:647)
at org.junit.Assert.assertEquals(Assert.java:128)
at org.junit.Assert.assertEquals(Assert.java:472)
at org.junit.Assert.assertEquals(Assert.java:456)
at 
org.apache.flume.channel.TestSpillableMemoryChannel.testTotalStoredSemaphore(TestSpillableMemoryChannel.java:735)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:46)
at org.junit.rules.RunRules.evaluate(RunRules.java:18)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
at 
org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
at 
org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115)
at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75)


Results :

Failed tests:   
testTotalStoredSemaphore(org.apache.flume.channel.TestSpillableMemoryChannel): 
expected:0 but was:4500

{code}

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, 
 FLUME-1227.v6.patch, FLUME-1227.v7.patch, SpillableMemory Channel Design 
 2.pdf, SpillableMemory Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to 

[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-12-09 Thread Roshan Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843787#comment-13843787
 ] 

Roshan Naik commented on FLUME-1227:


- will fix the 80 character length issue you noted
- I will need to review code wrt your other comments related to Txn correctness 
more closely. let me get back to you on them.
- [~hshreedharan] could you please confirm that the test failure was noticed in 
in patch v7 ? 

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, 
 FLUME-1227.v6.patch, FLUME-1227.v7.patch, SpillableMemory Channel Design 
 2.pdf, SpillableMemory Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-12-09 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843789#comment-13843789
 ] 

Hari Shreedharan commented on FLUME-1227:
-

Yes, it was v7. 

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, 
 FLUME-1227.v6.patch, FLUME-1227.v7.patch, SpillableMemory Channel Design 
 2.pdf, SpillableMemory Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-11-05 Thread Roshan Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13814292#comment-13814292
 ] 

Roshan Naik commented on FLUME-1227:


[~hshreedharan] , all the review comments should be addressed now. if there are 
no other concerns, could you commit this ?

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, FLUME-1227.v2.patch, FLUME-1227.v5.patch, 
 FLUME-1227.v6.patch, SpillableMemory Channel Design 2.pdf, SpillableMemory 
 Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-10-15 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13795589#comment-13795589
 ] 

Hari Shreedharan commented on FLUME-1227:
-

[~roshan_naik] - Could you please update the patch on rb?

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, FLUME-1227.v2.patch, SpillableMemory 
 Channel Design 2.pdf, SpillableMemory Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-10-15 Thread Roshan Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13795614#comment-13795614
 ] 

Roshan Naik commented on FLUME-1227:


[~hshreedharan] just updated it.

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, FLUME-1227.v2.patch, SpillableMemory 
 Channel Design 2.pdf, SpillableMemory Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-08-22 Thread Roshan Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13747758#comment-13747758
 ] 

Roshan Naik commented on FLUME-1227:


[hshreedharan], others interested.. could you take a stab at reviewing this 
code ? 

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, FLUME-1227.v2.patch, SpillableMemory 
 Channel Design 2.pdf, SpillableMemory Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-08-01 Thread Roshan Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13726702#comment-13726702
 ] 

Roshan Naik commented on FLUME-1227:


Appreciate your feedback Hari.


HARI  It looks like channel can actually return fewer events than total 
available in the case where there are only n events in the primary queue and 
an n+1-th take would happen - since the events in a particular txn will 
always come from one queue. I think we should be able to pull events from the 
other store if it turns out to be required - else we expect the sink to come 
back and poll immediately - and also cause sink side transactions to be smaller 
than they have to be - which can cause Avro/HDFS batch sizes to be smaller than 
configured causing perf issues.


Yes that is correct. The sink's transaction batch size would be smaller in that 
case. The case
 would only occur in when the take transaction transitions between overflow and 
primary. 
The alternative, as you sugest, is to pull from both overflow and primary, but 
that opens up some fundamental problems similar to distributed transactions. 
Essentially the sink needs to have
two transactions open (one each on overflow and primary) which needs to be 
atomically committed/rolledback. Thoughts ?


HARI  How the channel recovers from an overflow situation.

I have updated the design doc (section 2.1.2) to elaborate on this. The short 
version is:

New incoming events will go into primary if the sinks have drained older events 
from the primary
even if overflow is not empty.  

Let me know if the description addresses your question sufficiently.

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, FLUME-1227.v2.patch, SpillableMemory 
 Channel Design 2.pdf, SpillableMemory Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-07-26 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13721312#comment-13721312
 ] 

Hari Shreedharan commented on FLUME-1227:
-

Hi Roshan,

Thanks for the updated design doc and patch. I looked at the design doc and 
this approach looks good. I like the fact that there are no dependencies (at 
least as mentioned in the doc) on the file channel's implicit behavior. I have 
on question though. The drain order queue seems to keep a count of how many 
events are written to which store each time a write happens (using the -ve and 
+ve numbers). It looks like channel can actually return fewer events than total 
available in the case where there are only n events in the primary queue and 
an n+1-th take would happen - since the events in a particular txn will 
always come from one queue. I think we should be able to pull events from the 
other store if it turns out to be required - else we expect the sink to come 
back and poll immediately - and also cause sink side transactions to be smaller 
than they have to be - which can cause Avro/HDFS batch sizes to be smaller than 
configured causing perf issues. 

Also, I am not clear on how the channel recovers from an overflow situation. 
Assume that the primary has capacity of n and we are currently overflowing. 
When do we decide to go back to the primary? Is it when all n from the 
primary have been removed, or we don't go back to it until restart (sorry I 
didn't look at the code yet - this does not seem to have gotten a mention in 
the design doc).



 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, FLUME-1227.v2.patch, SpillableMemory 
 Channel Design 2.pdf, SpillableMemory Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-04-10 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13628201#comment-13628201
 ] 

Hari Shreedharan commented on FLUME-1227:
-

Thanks for your patience with this Roshan.
This approach seems fine. It is a good idea to explicitly do the instantiation 
inside the SC. You can go ahead with that for now I guess.

But here is some food for thought - The fundamental difference between this 
channel and the File Channel is the way the transactions get written out. Have 
you considered inheriting the File Channel and then adding a 2nd data structure 
(your primary memory channel) and have the decision making happen in the 
transaction code? I am not sure how feasible it is or even how smart an idea it 
is, but it might be worth considering.

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, SpillableMemory Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-04-10 Thread Roshan Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13628260#comment-13628260
 ] 

Roshan Naik commented on FLUME-1227:


Thats a very interesting suggestion. Thanks. I shall play with that idea also.

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, SpillableMemory Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-04-08 Thread Roshan Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13625573#comment-13625573
 ] 

Roshan Naik commented on FLUME-1227:


Hari, Juhani, if there is no additional concerns then i shall proceed with this 
approach. Settling on the general approach now will help us avoid pouring 
efforts into an unacceptable direction. I shall wait for another day before 
proceeding.

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, SpillableMemory Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-04-08 Thread Juhani Connolly (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626167#comment-13626167
 ] 

Juhani Connolly commented on FLUME-1227:


Seems like a reasonable compromise to me. I think any approach will have 
issues. 3 would probably be preferable to 4 if it's doable

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, SpillableMemory Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-04-03 Thread Mike Percy (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13621057#comment-13621057
 ] 

Mike Percy commented on FLUME-1227:
---

Roshan, that sounds good to me. Hari, Juhani, do you guys have any additional 
feedback on this proposal?

Thanks,
Mike

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, SpillableMemory Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-03-27 Thread Roshan Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614991#comment-13614991
 ] 

Roshan Naik commented on FLUME-1227:


I am not particularly wedded to the current approach. My first attempt based on 
your suggestion to inline the config of overflow channel in the SC itself. I 
discovered some [serious 
issues|https://issues.apache.org/jira/browse/FLUME-1227?focusedCommentId=13540116page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13540116]
 with it and so I pursued the alternative that had been discussed (but w/o 
consensus). Intent was to get the less contentious core logic working and 
return quickly to this phase of getting feedback on these shaky parts.

- Since you mention it, explicitly depending on FC  ( i assume by invoking 'new 
FileChannel()' inside SC ) ... has not been discussed. It might be worth 
considering. 

- Forking FC / Creating yet another durable channel : This has talked about and 
concerns have been with duplication of code (perhaps the most complex piece 
Flume code). I think Juhani also noted the same. I too am concerned about that. 
If forked.. each FC bug would have to fixed in 2 places. FC seems to keep 
evolving, and the for will likely become stale. I wonder, if it makes sense to 
derive a class from FC and use it as overflow instead.

- Your unresolved code review Question: We spoke about this when we met at the 
Flume meetup. On restart the overflow is drained completely first. It is 
addressed in the design doc under 'recovery from failures' but perhaps not very 
clearly.

- Yes, if SC does not have to guarantee strict ordering, then as long as counts 
in DOQ are correct, things will work fine. Ordering guarantees from overflow 
are needed only if SC is reqd to provide ordering guarantee. We already have a 
consensus that SC will not rely on any non-explicit FC guarantees.

- I totally agree with Hari and yourself on transactionCapacity issue. It makes 
total sense to expose channel size and capacity at the channel interface. I 
didn't do it in the first patch as I was afraid it might become a big point of 
contention. Perhaps a misplaced fear. MemC,FC  JdbcC may need minor tweaks for 
it. If there are no objections i can go ahead and make this change.


I think now the only remaining open issue is how to deal with Overflow. Let me 
list the options that have been put forward so far and some more : 

1) User specifies in config which channel to use as overflow : Current approach 
and has given me all the grief that i anticipated :)
2) Fork FC / create yet another durable FC like store. Then embed it into SC. 
Some comments have been made on this already.
3) Explicitly instantiate FC directly inside SC. 
4) Derive another class from FC and embed it into SC.
5) Based on Mike comment about SinkProcessors... Does it make sense to 
experiment with the notion of ChannelProcessors ? 
6) Any other ideas ? Now would be THE time to speak.



 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, SpillableMemory Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, 

[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-03-25 Thread Mike Percy (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613418#comment-13613418
 ] 

Mike Percy commented on FLUME-1227:
---

Roshan, thanks a lot for this design documentation.

Guys, based on my prior [reviewboard 
comment|https://reviews.apache.org/r/9544/] one big problem I have with this 
implementation is the way that the channels are allowed to know about each 
other. I am completely against this because it violates separation of 
responsibilities and encourages unmaintainable spaghetti dependencies between 
components. What's next, sinks? That is why we have SinkProcessors (so sinks 
don't have to know about each other). We simply cannot afford to open that 
Pandora's box. Let the SpillableChannel instantiate its own dependencies and 
govern their lifecycle.

If explicitly depending on the file channel is a problem, then let's talk about 
ways to mitigate that... either forking a copy of the FC code into SC so that 
FC can evolve separately, or explicitly not relying on ordering in SC, if that 
is the issue. Therefore SC would not have ordering guarantees. Can the Drain 
Order Queue survive that situation? It makes me a little nervous that DOQ even 
exists to be honest... I don't really like it. It seems like a somewhat complex 
and brittle mechanism for achieving this spill functionality. But I would not 
block this patch because I'm not in love with the DOQ. And I think if the SC 
doesn't have to guarantee order then as long as its counts are correct then it 
should still work. Correct me if I'm wrong.

If specific non-explicit guarantees of the FC are being relied on then an 
alternative is to consider a different design that relies on different 
invariants than the DOQ does. I'm not necessarily advocating for that, I'm just 
throwing it out there as an option. But I'd be happy with forking the FC and 
getting this checked in without a total redesign to make progress if that 
addresses others' concerns.

My other as-yet unresolved item of code review feedback involved what happens 
when the agent is stopped then restarted while the channel has events in both 
the primary and secondary channels. Can this please be addressed as well?

Additionally, I agree with Hari on the use of transactionCapacity as a poor 
substitute for a reservation amount on the underlying channels. We need a 
better way, and if exposing channel size and capacity via an interface will 
help then I'm all for it.

Regards,
Mike


 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, SpillableMemory Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-03-23 Thread Roshan Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13611899#comment-13611899
 ] 

Roshan Naik commented on FLUME-1227:


 
- I concur that unspecified guarantees should not be depended upon. I can drop 
that assumption from the tests.

- I think its very important to not continue to leave the guarantees 
unspecified. But that's for another Jira.

- WRT to deferring the decision to commit() time. Let me revisit that issue. 
 

*Instantiationa  config*:
For discussion, I would like to treat instantiation (new up the object) 
separate from life cycle (start/stop). Since existing instance may get reused 
during reconfigure. 

Overflow does not need to be instantiated or configured before SC! Just like 
sources, sinks and channels can be instantiated and configured independently in 
any order. Only start/stop needs to co-ordinated between the two. Also we need 
to ensure that SC is not able to get a reference to overflow if overflow had 
configuration errors.

 All components (sinks/sources/channels) get introduced to each other after 
they are correctly configured. There is already a step to introduce configured 
sinks and sources to their channels. I have extended that step to introduce 
channels to each other. The current implementation is a bit permissive and 
could be tightened up so that SC is limited to obtaining a handle only its 
overflow (not other channels).

*Life cycle*:
Hari, Correct me if you think its not the case, but i think the current design 
is in tune with your desire that the SC owns the lifecycle (start/stop) of the 
overflow. Config subsystem merely instantiates, configures and introduces the 
two channels to each other. Thereafter it disowns the lifecycle of overflow and 
lets the SC manage overflow's lifecycle. It retains ownership of SC's lifecycle 
however. This is nice because we dont have to replicate solutions to some of 
the config related aspects in SC. We don not have to worry about the order in 
which channels are instantiated and configured, and at the same time gain 
control over the order in which the start/stop is called on the SC and its 
overflow.


*Scribe*:
 Juhani, I think spilling policy can we definitely tweaked. Right now I spill 
into overflow only when primary is full. I like the idea that we can take a cue 
from the fact that takes() have begun to fail and start spilling early to 
minimize data loss. There is a throughput concern that I have with Scribe's 
operating mode where it switches exclusively to using either memory or disk. In 
SC's design we do not need to wait for the overflow to completely drain before 
resuming the use of the faster primary. I'll look more into scribe and see what 
we can leverage.


- The fsync experiment is something i would like to defer and resolve other 
open items. It does not look like a blocker and more of a perf tuning thing. 
does that sound reasonable ?


 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, SpillableMemory Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion 

[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-03-22 Thread Juhani Connolly (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13610111#comment-13610111
 ] 

Juhani Connolly commented on FLUME-1227:


I would personally prefer seeing a dependence on existing channels than another 
implementation of something like the file channel and something like the memory 
channel. The code-base is already getting pretty big, and the interfaces are 
fixed. The spillable channel shouldn't even know or care about what type the 
main/sub channel are, just feed  them data. While it might not be the most 
optimal solution performance-wise, I think the cost would be small and it would 
give us less code to maintain overall. Either approach certainly has its merits.

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, SpillableMemory Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-03-21 Thread Roshan Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13608700#comment-13608700
 ] 

Roshan Naik commented on FLUME-1227:


Thanks Hari.

 1) WRT the concern on not depending on another channel, i went down this path 
since it looked like there was some consensus when i started. What alternative 
design do you have in mind ?

 2) WRT change in memory/file channel breaking the Spillable channel: Could you 
expand a bit ? I am not familiar with replay order issue and how it can impact. 
 I dont think there is any intrinsic assumption being made wrt to any specific 
channel's behavior. Just to be doubly sure, i made sure not to rely on a single 
type of overflow channel in all the tests. The only material dependency (as far 
as I can tell) that Spillable Channel has on the overflow is the interface 
level guarantee that is expected from all channels: that order is maintained in 
case of single source/sink. 
Do you see any other assumptions/dependencies hiding there ?

 3) WRT reserving capacity on both channels. If you mean that each txn should 
not reserve capacity on both channels. I agree. And the current implementation 
does not do that. Or were you by any chance referring to the issue of upfront 
reservation (at put() time) versus commit() time ?

 4) WRT to testing with fsyncs removed, i have not pursued it since i felt that 
would be compromising the durability guarantees. Do you think its useful to do 
that ? 

 5) WRT we should make the configuration change. Can you elaborate ? I am not 
certain which change specifically you are referring to.  Or are you referring 
to the whole config approach ?
 
 6) WRT lifecycle management and dependencies : After configuration, any 
channel that is found to be not connected with a source/sink is automatically 
discarded from the list of Life cycle system managed components. Consequently 
the Spillable Channel becomes the sole life cycle manager of the overflow 
channel. Otherwise, yes there would be havoc.



 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, SpillableMemory Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-03-21 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13609360#comment-13609360
 ] 

Hari Shreedharan commented on FLUME-1227:
-

{quote}
1) WRT the concern on not depending on another channel, i went down this path 
since it looked like there was some consensus when i started. What alternative 
design do you have in mind ?

2) WRT change in memory/file channel breaking the Spillable channel: Could you 
expand a bit ? I am not familiar with replay order issue and how it can impact. 
I dont think there is any intrinsic assumption being made wrt to any specific 
channel's behavior. Just to be doubly sure, i made sure not to rely on a single 
type of overflow channel in all the tests. The only material dependency (as far 
as I can tell) that Spillable Channel has on the overflow is the interface 
level guarantee that is expected from all channels: that order is maintained in 
case of single source/sink. 
Do you see any other assumptions/dependencies hiding there ?
{quote}

I am sorry, I was not part of the initial discussions - so I was not aware of 
the consensus aspect. What I am saying is that being dependent on another 
channel creates an undesired strong coupling between this channel and the other 
channels. An if there are unit tests in this channel which can break if one of 
the other channels' behavior is changed, then it is not something that is 
acceptable. If you look at all our other components, none of them have a 
dependence on each other (except the RPCClients - that is because the sinks are 
just glorified RPCClients). 

The reason I would not agree with even the single source/sink replay order is 
that our interfaces do not really enforce this. This is not really even 
enforced anywhere in the documentation either. The FileChannel did not even 
conform to that single source/sink replay order until FLUME-1432. In fact, 
conforming to that order even in FLUME-1432 was a side-effect of fixing a race 
condition, and not specifically because it was meant to be handled. At some 
point, if it is decided this can change again to some other order (maybe a 
thread based ordering, or or an order in which events in a transaction will all 
get written out together on commit, rather than getting written out on put and 
fsynced on commit), then if this channels' tests break, the onus will be on the 
contributor who submitted the file channel change to fix it - which I do not 
agree with.

In summary, I am ok with depending on other channels. What I am not ok with is 
depending on the behavior of those channels, which are not explicitly 
guaranteed through interfaces (or even documentation).

bq. 3) WRT reserving capacity on both channels. If you mean that each txn 
should not reserve capacity on both channels. I agree. And the current 
implementation does not do that. Or were you by any chance referring to the 
issue of upfront reservation (at put() time) versus commit() time ?

I am talking about put v/s commit time. In most cases, transaction capacity is 
often configured to be much higher than the the max expected in most cases. I 
would suggest doing a full implementation where there is a transaction outside, 
and a backing store inside. Once the transaction is about to get committed, 
then decide where the events go. (It is going to be tricky to do this and avoid 
doing all the writes at once - the File Channel fsyncs on commit, but writes to 
OS buffers on every write - so it is possible some data is flushed to disk 
before explicit fsyncs). This is not a blocker anyway, we can work on it later 
as well.

bq. 4) WRT to testing with fsyncs removed, i have not pursued it since i felt 
that would be compromising the durability guarantees. Do you think its useful 
to do that ?

I was wondering whether simply adding a config param to change the fsyncs 
(fsync all files before checkpoint in parallel or something) to optional will 
give comparable performance to what is being proposed in this jira. I have a 
feeling it might, since fsyncs are the most expensive part of the file channel, 
and removing the fsyncs just writes to the in-memory OS buffer and the fsyncs 
will be taken care of in the background. 

{quote}
5) WRT we should make the configuration change. Can you elaborate ? I am not 
certain which change specifically you are referring to. Or are you referring to 
the whole config approach ?
6) WRT lifecycle management and dependencies : After configuration, any channel 
that is found to be not connected with a source/sink is automatically discarded 
from the list of Life cycle system managed components. Consequently the 
Spillable Channel becomes the sole life cycle manager of the overflow channel. 
Otherwise, yes there would be havoc.
{quote}

I just think we should not allow one component to pull a reference to another 
component in the system. This explicitly breaks the 

[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-03-21 Thread Juhani Connolly (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13609824#comment-13609824
 ] 

Juhani Connolly commented on FLUME-1227:


I had a look at the design doc and comments so just thought I'd chip in.

So long as we're only depending on the Channel interface for behaviors, I think 
we're good, I believe this was the intention in an earlier proposal of this 
feature.

I agree with Hari about ordering. It's not a guarantee we enforce in flume, and 
while nice, I think that it over-complicates things. 

As to lifecycle management, I don't necessary feel that having a channel own 
it's sub-channels is a particularly good precedent. I think it would be 
preferable that we allow the lifecycle manager to return interfaces rather than 
having components creating other components explicitly. Configuration would 
have to have some  grasp of dependencies though... Sub-channels would need to 
be instantiated before the owner

As to the fsync thing: definitely should be an option. Separate issue though. 
Making it possible to disable it would be great. Since this depends on in 
memory data, durability really shouldn't be an issue. If you have data in 
memory, it doesn't really matter if it's in the memory channel or in the OS 
file buffer

One thing you may want to consider is the approach taken by scribed(which has 
other problems,  but the buffer store implementation is very nice):
- Default to using the main channel
- Upon a next hop failure(roll back of take transaction in our case), switch to 
a buffering mode. All data is sent to the buffer channel until recovery. One 
may want to move the contents of the primary channel to the buffer if 
maintaining ordering is an objective. This could also reduce loss of data.
- During buffering mode, puts and takes go to the buffer channel, until it has 
been drained. Once it has been drained, return to streaming mode where 
operations are performed against the primary channel.

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, SpillableMemory Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-03-21 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13609911#comment-13609911
 ] 

Hari Shreedharan commented on FLUME-1227:
-

Hi Juhani,

Thanks for you comments. I agree with most of what you have mentioned.
{quote}
As to lifecycle management, I don't necessary feel that having a channel own 
it's sub-channels is a particularly good precedent. I think it would be 
preferable that we allow the lifecycle manager to return interfaces rather than 
having components creating other components explicitly. Configuration would 
have to have some grasp of dependencies though... Sub-channels would need to be 
instantiated before the owner
{quote}

I agree with your last statement. Configuration will also need to detect cycles 
etc so that you don't have a cycle of interdependent components. I don't 
particularly like the idea of passing references of existing channels to others 
to use as sub-channels - something that I don't like, but won't block since 
there seems to have been some consensus regarding this earlier. I frankly think 
2 channels within the same one is overkill. I think this channel can be easily 
implemented by using a mmap-ed file which is never specifically fsync-ed. This 
might cause some page faults etc., but the page cache management is usually 
smart enough to not cause this to affect performance a whole lot - this 
implementation is likely to be faster too (in fact, this is very similar to the 
File Channel checkpoint class). Using this as a cyclic buffer would probably be 
as good, and gives the same guarantees as the memory channel (which is what we 
are targeting in this jira, I suppose?). 

Also, I like the implementation you have mentioned above, though this can be 
quite tricky to get right. 

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, SpillableMemory Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-03-19 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13606137#comment-13606137
 ] 

Hari Shreedharan commented on FLUME-1227:
-

Roshan, 

Sorry it took me this long to get to this one. I reviewed the design document 
and I have a couple of relatively major concerns:

#. This channel implicitly depends on the behavior of current channels - the 
File Channel and Memory Channel. As one of the people who maintain the file 
channel, I strongly feel this is not the correct thing to do. It is possible 
that behavior of the File Channel or the Memory Channel could change (This is 
not without precedent. In FLUME-1437, we did change the replay order). At that 
point, a change in the behavior of the File Channel or Memory Channel would 
break unit/integration tests for this channel - which could delay a commit. 

#. I don't think we should make the configuration change. The idea of the 
Lifecycle manager is to handle all the components and make them independent of 
each other. Dependencies on other components managed by the Lifecycle system is 
a bad idea. This also sets a bad precedent. This can lead to patches that make 
component inter-dependent and depend on the other component being a particular 
one (example a source using this hook to figure out if it is operating on 
Memory Channel or File Channel). 

I believe the current design is a bit more complex than it needs to be - due to 
the handling of more than one transaction. Also reserving transaction capacity 
on both channels is a bad indicator of where the txn should go. In my 
experience, people do set the transaction capacity to a value much higher than 
the average transaction. 

Also, have you tested this against a slightly modified File Channel with all of 
the fsyncs removed (or commented out)? I'd be interested in seeing the 
difference in performance at that point. Also, see FLUME-1423 where Denny 
removed the fsyncs for performance (the performance of the channel has improved 
even more since then though).

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, SpillableMemory Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-03-14 Thread Roshan Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13603141#comment-13603141
 ] 

Roshan Naik commented on FLUME-1227:


Looking to revive attention on this one.

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1, SpillableMemory Channel Design.pdf


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-02-27 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588425#comment-13588425
 ] 

Brock Noland commented on FLUME-1227:
-

Same as Mike. [~hshreedharan] any time for a review? 

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2013-02-27 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588624#comment-13588624
 ] 

Hari Shreedharan commented on FLUME-1227:
-

I can take a quick look later today, though I can't promise when I can do a 
full review.

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik
 Attachments: 1227.patch.1


 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2012-12-27 Thread Roshan Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13540116#comment-13540116
 ] 

Roshan Naik commented on FLUME-1227:


Seeking input ..

The current configuration system does not look conducive to chaining channels. 
Here are the config techniques that has been previously talked about :

1) Out-of-line:

agent1.channels = channel1 channel2

agent1.channels.channel1.type = SPILLABLE
agent1.channels.channel1.overflow  = channel2

agent1.channels.channel2.type = FILE
agent1.channels.channel2.checkpointDir = /path1
...


The problem here is that ..
- At the time channel1 is configured, channel2 may not have been instantiated 
yet. So it is not possible to latch on to an instance of channel2. So it may be 
better to defer obtaining a reference to the overflow channel at start time.
- No mechanism to get a reference to one channel from another (in this case, at 
start time)




2) Inline: (as suggested by Mike)

agent1.channels = channel1 

agent1.channels.channel1.type = SPILLABLE
agent1.channels.channel1.overflowChannel.type = FILE
agent1.channels.channel1.overflowChannel.checkpointDir = /path1
agent1.channels.channel1.overflowChannel.dataDirs = /path2
... 

The issue here is that the instantiation and configuration of the overflow 
channel will now have to reside inside   SpillableChannel::configure(). This 
method is not a very conducive place for doing such things.


3) Hard coding
Basically hard code the file channel to be the overflow channel. this allows 
the file channel to be easily instantiated and configured. downside is that it 
still duplicates the channel instantiation/config logic from 
AbstractConfigurationProvider.loadChannels()

Any thoughts ? 


 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Roshan Naik

 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2012-12-05 Thread Mike Percy (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13510888#comment-13510888
 ] 

Mike Percy commented on FLUME-1227:
---

Hey Roshan, sounds good to me except I'd recommend trying this out with a brand 
new channel that delegates to a memory channel, in order to minimize the risk 
of destabilizing what is a very solid and important core component.

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho

 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2012-12-05 Thread Roshan Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13510988#comment-13510988
 ] 

Roshan Naik commented on FLUME-1227:


You mean we conceptually create a new MemChannel++ ? where the ++ part is 
basically the overflow ability ?

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho

 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2012-12-05 Thread Mike Percy (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13511024#comment-13511024
 ] 

Mike Percy commented on FLUME-1227:
---

Right. Or we could call it SpillableChannel I guess. :) I don't have a strong 
opinion on the name, personally.

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho

 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2012-11-29 Thread Roshan Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13506841#comment-13506841
 ] 

Roshan Naik commented on FLUME-1227:


Hi Mike.. yes you are right.. i think it is a downside of that algorithm. i 
realized the same after posting that comment.

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho

 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2012-11-27 Thread Roshan Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504893#comment-13504893
 ] 

Roshan Naik commented on FLUME-1227:


Thanks for those valuable thoughts Mike.

 I have described an algorithm for puts/takes 
[here|https://issues.apache.org/jira/browse/FLUME-1227?focusedCommentId=13493481page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13493481].
 It should solve the ordering problem, handle transactions correctly and 
maximize throughput.

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho

 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2012-11-27 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13505021#comment-13505021
 ] 

Brock Noland commented on FLUME-1227:
-

If we move forward with this proposal, I think it'd be great to see a design 
document.

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho

 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2012-11-26 Thread Roshan Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504291#comment-13504291
 ] 

Roshan Naik commented on FLUME-1227:


Continuing the discussion...

I spent some time studying the discussions in the jiras related to solving the 
problem of spilling over (and/or failover). I think failover and spillover 
should not be conflated to be the same problem ... even though it may be 
possible to address them both in the same solution.

There is a consensus that the problem worth addressing. There are concerns 
hovering around these dimensions.

1) complexity of implementation and configuration. also potentially 
[enhancements|https://issues.apache.org/jira/browse/FLUME-1045?focusedCommentId=13430529page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13430529]
 to existing interfaces
2) complexity of testing
3) Ensuring transaction guarantees are preserved and its weakness/strength level
4) Defining the durability level (durable or not) of the final solution .. this 
is simple IMHO
5) Efficiency of the solution (batching requests during when spilling over)
6) Flexibility

So far the solutions discussed along with their concerns ..

 1) FailOver Sink processor  -  has issues with retaining transaction 
guarantees 
([Reference|https://issues.apache.org/jira/browse/FLUME-1045?focusedCommentId=13235705page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13235705])

 2) Mechanisms for Composing Existing Channels  
([1201|https://issues.apache.org/jira/browse/FLUME-1201] and [my 
proposal|https://issues.apache.org/jira/browse/FLUME-1227?focusedCommentId=13492828page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13492828])
 -  Flexible but has complexities in regards to testing ([mixed opinions 
here|https://issues.apache.org/jira/browse/FLUME-1201?focusedCommentId=13282018page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13282018]),
 implementation  determining durability 
[See|https://issues.apache.org/jira/browse/FLUME-1045?focusedCommentId=13235705page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13235705]

 3) Spillable Channel - Limited functionality but easier to test and determine 
transaction+durability semantics. 



My thoughts...
  The concerns related to mechanisms for composing channels is largely centered 
around complexities. I feel some of them are not true.
 
  Testing a composition mechanism is not as complex as it has been feared for 
reasons stated 
[here|https://issues.apache.org/jira/browse/FLUME-1201?focusedCommentId=13282018page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13282018].

 In a pluggable system (like rest of flume) we rely on guarantees from the 
interface itself. There is no need to test all combination of all possible 
channels for testing. Just like it does not make sense to test all combinations 
of sink/channel/source/interceptors/sink-processors in Flume.

Implementation of a composite mechanisms would also be simpler. It would be 
focussed only around issues involved in stitching channels. Not in actually 
providing a robust backing store. 

Spillover channel (Mem + File) seems a little too specialized .. for instance 
it does not provide durability for users if needed. It is nice to allow the 
primary channel to be on a fast smaller durable store (like SSDs) and overflow 
into a another slower durable store (like hard disk /jdbc)

the following general strategy for compounding channels seems worth discussing 
..

agent1.channels.compoundChannel.type = compound
agent1.channels.compoundChannel.1 = memChannel1
agent1.channels.compoundChannel.2 = fileChannel1
agent1.channels.compoundChannel.3 = jdbcChannel1

agent1.channels.compoundChannel.1.overflowBatchSize = 100   # batch size when 
spilling into fileChannel1
agent1.channels.compoundChannel.2.overflowBatchSize = 1000  # batch size when 
spilling into jdbcChannel1



 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho

 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the 

[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2012-11-12 Thread Roshan Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13495737#comment-13495737
 ] 

Roshan Naik commented on FLUME-1227:


Looks like this jira is up for grabs ?? 
If there is agreement that my proposal is a good way forward I would like to 
pick it up.
Thoughts ?

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho

 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2012-11-12 Thread Roshan Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13495778#comment-13495778
 ] 

Roshan Naik commented on FLUME-1227:


actually i think.. this proposal, if acceptable, would have to be a different 
jira. since the current jira is about introducing a new channel.

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho

 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2012-11-12 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13495785#comment-13495785
 ] 

Hari Shreedharan commented on FLUME-1227:
-

Roshan - that might be a good thing to do - but there was a discussion about a 
compound channel several months ago, and I believe the consensus was that it 
would be too complex to write and even more complex to test. But feel free to 
file a jira - I am sure there will be a healthy discussion.

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho

 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2012-11-12 Thread Bernardo de Seabra (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13495798#comment-13495798
 ] 

Bernardo de Seabra commented on FLUME-1227:
---

I like this approach (quite popular with Scribe) but my only concern is around 
performance. You would get unexpected/unpredictable performance impact on disk 
IO which could be (in our case it would be) impacting your application if flume 
and the app are sharing the same disk. It's a tradeoff.

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho

 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2012-11-08 Thread Roshan Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493481#comment-13493481
 ] 

Roshan Naik commented on FLUME-1227:


I agree Scribe's policy is sub optimal.  It is better to prioritize the parent 
channel whenever it has spare capacity and still maintain order. To achieve 
this I have a simple algorithm in mind...

The parent channel maintains a 'drain order' queue of signed numbers which 
indicates at anytime the order in which the items in it and its overflow 
channel should be drained.  For instance the following numbers in that queue 
[3,-2,6,-1] indicate the following drain order: 

- drain 3 from self
- then drain 2 from overflow
- then 6 from self
- then 1 from overflow


The channel's put() will update its drain order queue  (DOQ) as follows:

  if(I have capacity) {
 + add event to my DOQ
 + if last element in DOQ is +ve then increment it
 + else push +1 to DOQ
  } else {
 + Call put() on overflow
 + if last element in DOW is -ve then decrement it
 + else push -1 to DOQ
  }
 
I think the take() should be obvious.

Obviously corner cases like empty self and empty overflow need to be handled 
appropriately.. but this is just capturing the idea.



 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho

 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2012-11-08 Thread Roshan Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493500#comment-13493500
 ] 

Roshan Naik commented on FLUME-1227:


apologies for email storm created by multiple edits to my prev comment.

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho

 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2012-11-07 Thread Roshan Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13492828#comment-13492828
 ] 

Roshan Naik commented on FLUME-1227:


I dont see this option discussed but it seems interesting  (and IMO avoids some 
of the issues in sink triggered spooling as discussed in FLUME-1045).

Basically instead of adding another Spillable channel which is logically a 
composite of mem  file channels, we could add a config directive to Memory 
Channel such as:

agent1.channels.memChannel1.overflow = fileChannel1

Basically, there would be a preconfigured file channel (or jdbc or some custom 
channel) into which memory channel would simply spill over events into when 
capacity has been reached. There should be no other sources or sinks tied to an 
overflow channel. 

Ideally any channel should be able to use another channel for overflow. 

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Patrick Wendell

 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2012-11-07 Thread Juhani Connolly (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13492876#comment-13492876
 ] 

Juhani Connolly commented on FLUME-1227:


Interesting suggestion... When would you suggest that the overflow channels 
contents be read, and by what component?

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Patrick Wendell

 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2012-11-07 Thread Roshan Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13492923#comment-13492923
 ] 

Roshan Naik commented on FLUME-1227:


The parent channel's put()/take() will be the source/sink for its overflow 
channel.  

For the special case of just supporting it in memory channel, I think it could 
easily employ whatever policy the SpillableChannel would have used. 

For the more general case of making this a cross-cutting feature available to 
all channels with the ability to chain, i would conjecture, it may be possible 
to use the same policy at each level of the chain.  So this policy could be 
pushed into the common base class for channels.

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Patrick Wendell

 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2012-11-05 Thread Rahul Ravindran (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13491154#comment-13491154
 ] 

Rahul Ravindran commented on FLUME-1227:


Is there a timeline on when this new channel would be out?

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Patrick Wendell

 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2012-11-05 Thread Mike Percy (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13491209#comment-13491209
 ] 

Mike Percy commented on FLUME-1227:
---

I don't know of anyone actively working on this...

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Patrick Wendell

 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2012-08-09 Thread Juhani Connolly (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431609#comment-13431609
 ] 

Juhani Connolly commented on FLUME-1227:


Since the channel is not aware of the state of sinks, I think Jareks proposed 
method sounds good.

In another place, it was pointed out that we cannot just change the interface 
as it will break peoples custom components.

However I think you can get away with a similar method to configurable now. Add 
a CapacityPollable interface or something, and check whether the channel 
implements it, polling if it exists. In the case of non-existence you will just 
have to rely on catching exceptions as an indicator of problems)

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Patrick Wendell

 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2012-08-07 Thread Seetharam Venkatesh (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13430470#comment-13430470
 ] 

Seetharam Venkatesh commented on FLUME-1227:


Does this mean there is no effort going into FLUME-1045?

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Patrick Wendell

 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (FLUME-1227) Introduce some sort of SpillableChannel

2012-08-03 Thread Denny Ye (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428531#comment-13428531
 ] 

Denny Ye commented on FLUME-1227:
-

That's great and useful when Flume cannot reaches to HDFS or other destination. 
Also it's the same concept in Scribe with named 'primary store' and 'secondary 
store'. Wish any implementations.

 Introduce some sort of SpillableChannel
 ---

 Key: FLUME-1227
 URL: https://issues.apache.org/jira/browse/FLUME-1227
 Project: Flume
  Issue Type: New Feature
  Components: Channel
Reporter: Jarek Jarcec Cecho
Assignee: Jarek Jarcec Cecho

 I would like to introduce new channel that would behave similarly as scribe 
 (https://github.com/facebook/scribe). It would be something between memory 
 and file channel. Input events would be saved directly to the memory (only) 
 and would be served from there. In case that the memory would be full, we 
 would outsource the events to file.
 Let me describe the use case behind this request. We have plenty of frontend 
 servers that are generating events. We want to send all events to just 
 limited number of machines from where we would send the data to HDFS (some 
 sort of staging layer). Reason for this second layer is our need to decouple 
 event aggregation and front end code to separate machines. Using memory 
 channel is fully sufficient as we can survive lost of some portion of the 
 events. However in order to sustain maintenance windows or networking issues 
 we would have to end up with a lot of memory assigned to those staging 
 machines. Referenced scribe is dealing with this problem by implementing 
 following logic - events are saved in memory similarly as our MemoryChannel. 
 However in case that the memory gets full (because of maintenance, networking 
 issues, ...) it will spill data to disk where they will be sitting until 
 everything start working again.
 I would like to introduce channel that would implement similar logic. It's 
 durability guarantees would be same as MemoryChannel - in case that someone 
 would remove power cord, this channel would lose data. Based on the 
 discussion in FLUME-1201, I would propose to have the implementation 
 completely independent on any other channel internal code.
 Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira