[jira] [Commented] (FLUME-1485) File Channel should support checksum

2012-08-14 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13434811#comment-13434811
 ] 

Hari Shreedharan commented on FLUME-1485:
-

I'm ok with file level checksum. Transaction level might need considerable 
changes in the file format etc.

> File Channel should support checksum
> 
>
> Key: FLUME-1485
> URL: https://issues.apache.org/jira/browse/FLUME-1485
> Project: Flume
>  Issue Type: New Feature
>  Components: Channel
>Reporter: Hari Shreedharan
> Fix For: v1.3.0
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (FLUME-1485) File Channel should support checksum

2012-08-14 Thread Hari Shreedharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLUME-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Shreedharan updated FLUME-1485:


Issue Type: New Feature  (was: Bug)

Ideally, this will be implemented in such a way that the channel format would 
not change. So the checksum will either need to go at the end or in a different 
file. So the new code can check for the (optional) checksum and use it. If no 
checksum is found, alert of channel corruption.



> File Channel should support checksum
> 
>
> Key: FLUME-1485
> URL: https://issues.apache.org/jira/browse/FLUME-1485
> Project: Flume
>  Issue Type: New Feature
>  Components: Channel
>Reporter: Hari Shreedharan
> Fix For: v1.3.0
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (FLUME-1485) File Channel should support checksum

2012-08-14 Thread Denny Ye (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13434809#comment-13434809
 ] 

Denny Ye commented on FLUME-1485:
-

which level? each transaction or identified file?

> File Channel should support checksum
> 
>
> Key: FLUME-1485
> URL: https://issues.apache.org/jira/browse/FLUME-1485
> Project: Flume
>  Issue Type: Bug
>  Components: Channel
>Reporter: Hari Shreedharan
> Fix For: v1.3.0
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (FLUME-1485) File Channel should support checksum

2012-08-14 Thread Hari Shreedharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLUME-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Shreedharan updated FLUME-1485:


  Component/s: Channel
Fix Version/s: v1.3.0

> File Channel should support checksum
> 
>
> Key: FLUME-1485
> URL: https://issues.apache.org/jira/browse/FLUME-1485
> Project: Flume
>  Issue Type: Bug
>  Components: Channel
>Reporter: Hari Shreedharan
> Fix For: v1.3.0
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (FLUME-1485) File Channel should support checksum

2012-08-14 Thread Hari Shreedharan (JIRA)
Hari Shreedharan created FLUME-1485:
---

 Summary: File Channel should support checksum
 Key: FLUME-1485
 URL: https://issues.apache.org/jira/browse/FLUME-1485
 Project: Flume
  Issue Type: Bug
Reporter: Hari Shreedharan




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Review Request: Flume adopt message from existing local Scribe

2012-08-14 Thread Denny Ye


> On Aug. 15, 2012, 4:01 a.m., Juhani Connolly wrote:
> > Ok, I had a look over, and tested the new code and it seems fine. Other 
> > additions look good too
> > 
> > It would have been nice to have the documentation as part of the review 
> > too, but this issue has been in review long enough, so I might open another 
> > ticket later to review the documentation and add more information on usage.

Thanks Juhani, I also consider that it need another case to review document. 
Need your advice in usage of ScribeSource


- Denny


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6089/#review10316
---


On Aug. 14, 2012, 6:53 a.m., Denny Ye wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/6089/
> ---
> 
> (Updated Aug. 14, 2012, 6:53 a.m.)
> 
> 
> Review request for Flume and Hari Shreedharan.
> 
> 
> Description
> ---
> 
> There may someone like me that want to replace central Scribe with Flume to 
> adopt existing ingest system, using smooth changes for application user.
> Here is the ScribeSource put into legacy folder without deserializing. 
> 
> 
> This addresses bug https://issues.apache.org/jira/browse/FLUME-1382.
> 
> https://issues.apache.org/jira/browse/https://issues.apache.org/jira/browse/FLUME-1382
> 
> 
> Diffs
> -
> 
>   trunk/flume-ng-dist/pom.xml 1370121 
>   trunk/flume-ng-sources/flume-scribe-source/pom.xml PRE-CREATION 
>   
> trunk/flume-ng-sources/flume-scribe-source/src/main/java/org/apache/flume/source/scribe/LogEntry.java
>  PRE-CREATION 
>   
> trunk/flume-ng-sources/flume-scribe-source/src/main/java/org/apache/flume/source/scribe/ResultCode.java
>  PRE-CREATION 
>   
> trunk/flume-ng-sources/flume-scribe-source/src/main/java/org/apache/flume/source/scribe/Scribe.java
>  PRE-CREATION 
>   
> trunk/flume-ng-sources/flume-scribe-source/src/main/java/org/apache/flume/source/scribe/ScribeSource.java
>  PRE-CREATION 
>   trunk/flume-ng-sources/pom.xml PRE-CREATION 
>   trunk/pom.xml 1370121 
> 
> Diff: https://reviews.apache.org/r/6089/diff/
> 
> 
> Testing
> ---
> 
> I already used ScribeSource into local environment and tested in past week. 
> It can use the existing local Scribe interface
> 
> 
> Thanks,
> 
> Denny Ye
> 
>



Re: Review Request: Flume adopt message from existing local Scribe

2012-08-14 Thread Juhani Connolly

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6089/#review10316
---

Ship it!


Ok, I had a look over, and tested the new code and it seems fine. Other 
additions look good too

It would have been nice to have the documentation as part of the review too, 
but this issue has been in review long enough, so I might open another ticket 
later to review the documentation and add more information on usage.

- Juhani Connolly


On Aug. 14, 2012, 6:53 a.m., Denny Ye wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/6089/
> ---
> 
> (Updated Aug. 14, 2012, 6:53 a.m.)
> 
> 
> Review request for Flume and Hari Shreedharan.
> 
> 
> Description
> ---
> 
> There may someone like me that want to replace central Scribe with Flume to 
> adopt existing ingest system, using smooth changes for application user.
> Here is the ScribeSource put into legacy folder without deserializing. 
> 
> 
> This addresses bug https://issues.apache.org/jira/browse/FLUME-1382.
> 
> https://issues.apache.org/jira/browse/https://issues.apache.org/jira/browse/FLUME-1382
> 
> 
> Diffs
> -
> 
>   trunk/flume-ng-dist/pom.xml 1370121 
>   trunk/flume-ng-sources/flume-scribe-source/pom.xml PRE-CREATION 
>   
> trunk/flume-ng-sources/flume-scribe-source/src/main/java/org/apache/flume/source/scribe/LogEntry.java
>  PRE-CREATION 
>   
> trunk/flume-ng-sources/flume-scribe-source/src/main/java/org/apache/flume/source/scribe/ResultCode.java
>  PRE-CREATION 
>   
> trunk/flume-ng-sources/flume-scribe-source/src/main/java/org/apache/flume/source/scribe/Scribe.java
>  PRE-CREATION 
>   
> trunk/flume-ng-sources/flume-scribe-source/src/main/java/org/apache/flume/source/scribe/ScribeSource.java
>  PRE-CREATION 
>   trunk/flume-ng-sources/pom.xml PRE-CREATION 
>   trunk/pom.xml 1370121 
> 
> Diff: https://reviews.apache.org/r/6089/diff/
> 
> 
> Testing
> ---
> 
> I already used ScribeSource into local environment and tested in past week. 
> It can use the existing local Scribe interface
> 
> 
> Thanks,
> 
> Denny Ye
> 
>



Re: [jira] [Created] (FLUME-1479) Multiple Sinks can connect to single Channel

2012-08-14 Thread Mike Percy
Yongkun Wang,
You're welcome! Very happy to hear your thoughts.

Regards,
Mike

On Tue, Aug 14, 2012 at 8:03 PM, Wang, Yongkun | Yongkun | BDD <
yongkun.w...@mail.rakuten.com> wrote:

> Thanks Mike.
>
> This is really a nice reply based on the thorough understanding of my
> proposal.
>
> I agree that it might be a potential design change. So I will carefully
> evaluate it before submitting it to you guys to make the decision.
>
> Cheers,
> Yongkun Wang
>
> On 12/08/13 9:17, "Mike Percy"  wrote:
>
> >Hi,
> >Due to design decisions made very early on in Flume NG - specifically the
> >fact that Sink only has a simple process() method - I don't see a good way
> >to get multiple sinks pulling from the same channel in a way that is
> >backwards-compatible with the current implementation.
> >
> >Probably the "right" way to support this would be to have an interface
> >where the SinkRunner (or something outside of each Sink) is in control of
> >the transaction, and then it can easily send events to each sink serially
> >or in parallel within a single transaction. I think that is basically what
> >you are describing. If you look at SourceRunner and SourceProcessor you
> >will see similar ideas to what you are describing but they are only
> >implemented at the Source->Channel level. The current SinkProcessor is not
> >an analog of SourceProcessor, but if it was then I think that's where this
> >functionality might fit. However what happens when you do that is you have
> >to handle a ton of failure cases and threading models in a very general
> >way, which might be tough to get right for all use cases. I'm not 100%
> >sure, but I think that's why this was not pursued at the time.
> >
> >To me, this seems like a potential design change (it would have to be very
> >carefully thought out) to consider for a future major Flume code line
> >(maybe a Flume 2.x).
> >
> >By the way, if one is trying to get maximum throughput, then duplicating
> >events onto multiple channels, and having different threads running the
> >sinks (the current design) will be faster and more resilient in general
> >than a single thread and a single channel writing to multiple
> >sinks/destinations. The multiple-channel design pattern will allow
> >periodic
> >downtimes or delays on a single sink to not affect the others, assuming
> >the
> >channel sizes are large enough for buffering during downtime and assuming
> >that each sink is fast enough to recover from temporary delays. Without a
> >dedicated buffer per destination, one is at the mercy of the slowest sink
> >at every stage in the transaction.
> >
> >One last thing worth noting is that the current channels are all well
> >ordered. This means that Flume currently provides a weak ordering
> >guarantee
> >(across a single hop). That is a helpful property in the context of
> >testing
> >and validation, as well as is what many people expect if they are storing
> >logs on a single hop. I hope we don't backpedal on that weak ordering
> >guarantee without a really good reason.
> >
> >Regards,
> >Mike
> >
> >On Fri, Aug 10, 2012 at 9:30 PM, Wang, Yongkun | Yongkun | BDD <
> >yongkun.w...@mail.rakuten.com> wrote:
> >
> >> Hi Jhhani,
> >>
> >> Yes, we can use two (or several) channels to fan out data to different
> >> sinks. Then we will have two channels with same data, which may not be
> >>an
> >> optimized solution. So I want to use just ONE channel, creating a
> >> processor to pull the data once from the channel, then distributing to
> >> different sinks.
> >>
> >> Regards,
> >> Yongkun Wang
> >>
> >> On 12/08/10 18:07, "Juhani Connolly" 
> >> wrote:
> >>
> >> >Hi Yongkun,
> >> >
> >> >I'm curious why you need to pull the data twice from the sink? Do you
> >> >need all sinks to have read the same amount of data? Normally for the
> >> >case of splitting data into batch and analytics, we will send data from
> >> >the source to two separate channels and have the sinks read from
> >> >separate channels.
> >> >
> >> >On 08/10/2012 02:48 PM, Wang, Yongkun | Yongkun | BDD wrote:
> >> >> Hi Denny,
> >> >>
> >> >> I am working on the patch now, it's not difficult. I have listed the
> >> >> changes in that JIRA.
> >> >> I think you misunderstand my design, I didn't maintain the order of
> >>the
> >> >> events. Instead I make sure that each sink will get the same events
> >>(or
> >> >> different events specified by selector).
> >> >>
> >> >> Suppose Channel (mc) contains the following events: 4,3,2,1
> >> >>
> >> >> If simply enable it by configuration, it may work like this:
> >> >> Sink "hsa" may get 1,3;
> >> >> Sink "hsb" may get 2,4;
> >> >> So different sink will get different data. Is this what user wants?
> >> >>
> >> >>
> >> >> In my design, "hsa" and "hsb" will both get "4,3,2,1". This is a
> >>typical
> >> >> case when user want to fan-out the data into two places (eg. One for
> >> >>batch
> >> >> and and another for real-time analysis).
> >> >>
> >> >> Regards,
> >> >> Yongkun Wang
> >> >>
> >> >

[jira] [Created] (FLUME-1484) Flume support throughput in Agent, Source, Sink level at JMX

2012-08-14 Thread Denny Ye (JIRA)
Denny Ye created FLUME-1484:
---

 Summary: Flume support throughput in Agent, Source, Sink level at 
JMX
 Key: FLUME-1484
 URL: https://issues.apache.org/jira/browse/FLUME-1484
 Project: Flume
  Issue Type: Improvement
  Components: Node, Sinks+Sources
Affects Versions: v1.2.0
Reporter: Denny Ye


>From user's view of point, we would like to know the current throughput from 
>one of monitoring tools. WebUI is best, of course. JMX is simple way to 
>implement throughput monitoring. 

Agent should have input and output throughput based on several Sources and 
Sinks.

Here is just simple code in my environment to monitoring throughput of Source.
{code:title=ThroughputCounter.java|borderStyle=solid}
import java.util.concurrent.atomic.AtomicInteger;

import org.apache.flume.instrumentation.SourceCounter;

public class ThroughputCounter {
  private volatile boolean isRunning; 
  private AtomicInteger cache = new AtomicInteger();
  
  SourceCounter sourceCounter;
  public ThroughputCounter(SourceCounter sourceCounter) {
this.sourceCounter = sourceCounter;
  }
  
  public void start() {
isRunning = true;

Counter counter = new Counter();
counter.start();
  }
  
  public void stop() {
isRunning = false;
  }
  
  
  public void addWriteBytes(int bytes) {
cache.getAndAdd(bytes);
  }
  
  private class Counter extends Thread {

Counter() {
  super("ThroughputCounterThread"); 
}

public void run() {
  while (isRunning) {
try {
  Thread.sleep(1000);
  sourceCounter.incrementSourceThroughput(
cache.getAndSet(0));
} catch (Exception e) {
  e.printStackTrace();
}
  }
}
  }
  
}
{code} 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (FLUME-1435) Proposal of Transactional Multiplex (fan out) Sink

2012-08-14 Thread Yongkun Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13434783#comment-13434783
 ] 

Yongkun Wang commented on FLUME-1435:
-

Nice comment from Mike.

On 12/08/13 9:17, "Mike Percy"  wrote:

Hi,
Due to design decisions made very early on in Flume NG - specifically the
fact that Sink only has a simple process() method - I don't see a good way
to get multiple sinks pulling from the same channel in a way that is
backwards-compatible with the current implementation.

Probably the "right" way to support this would be to have an interface
where the SinkRunner (or something outside of each Sink) is in control of
the transaction, and then it can easily send events to each sink serially
or in parallel within a single transaction. I think that is basically what
you are describing. If you look at SourceRunner and SourceProcessor you
will see similar ideas to what you are describing but they are only
implemented at the Source->Channel level. The current SinkProcessor is not
an analog of SourceProcessor, but if it was then I think that's where this
functionality might fit. However what happens when you do that is you have
to handle a ton of failure cases and threading models in a very general
way, which might be tough to get right for all use cases. I'm not 100%
sure, but I think that's why this was not pursued at the time.

To me, this seems like a potential design change (it would have to be very
carefully thought out) to consider for a future major Flume code line
(maybe a Flume 2.x).

By the way, if one is trying to get maximum throughput, then duplicating
events onto multiple channels, and having different threads running the
sinks (the current design) will be faster and more resilient in general
than a single thread and a single channel writing to multiple
sinks/destinations. The multiple-channel design pattern will allow
periodic
downtimes or delays on a single sink to not affect the others, assuming
the
channel sizes are large enough for buffering during downtime and assuming
that each sink is fast enough to recover from temporary delays. Without a
dedicated buffer per destination, one is at the mercy of the slowest sink
at every stage in the transaction.

One last thing worth noting is that the current channels are all well
ordered. This means that Flume currently provides a weak ordering
guarantee
(across a single hop). That is a helpful property in the context of
testing
and validation, as well as is what many people expect if they are storing
logs on a single hop. I hope we don't backpedal on that weak ordering
guarantee without a really good reason.

Regards,
Mike

> Proposal of Transactional Multiplex (fan out) Sink
> --
>
> Key: FLUME-1435
> URL: https://issues.apache.org/jira/browse/FLUME-1435
> Project: Flume
>  Issue Type: New Feature
>  Components: Channel, Sinks+Sources
>Affects Versions: v1.2.0
>Reporter: Yongkun Wang
>Assignee: Yongkun Wang
>  Labels: features
>
> Hi, 
> I have proposed this design by email several weeks ago. I received comment 
> from Brock. I guess your guys are very busy, so I think I'd better create 
> this JIRA, and put slides and patch here to explain it more clearly.
> Regards,
> Yongkun
> Following is the design from previous email, I will attach slides later.
> From: "Wang, Yongkun" 
> Date: Wed, 25 Jul 2012 10:32:31 GMT
> To: "dev@flume.apache.org" 
> Cc: "u...@flume.apache.org" 
> Subject: Transactional Multiplex (fan out) Sink
> Hi,
> In our system, we need to fan out the aggregated flow to several 
> destinations. Usually the flow to each destination is identical. 
> There is a nice feature of NG, the "multiplexing flow", which can satisfy our 
> requirements. It is implemented by using separated channels, which is easy to 
> do transaction control.
> But in our case, the fan out is replicating in most cases. If using the 
> current "Replicating to Channels" configuration, we will get several 
> identical channels on the same host, which may consume a large amount 
> resources (memory, disk, etc.). The performance may possibly drop. And the 
> events to each destination may not be synchronized.
> I read NG source, I think I could move the multiplex from Channel to Sink, 
> that is, using single Channel, fan out to different Sinks, which may solve 
> the problems (resource usage, performance, event synchronization) of multiple 
> Channels.
> I see that there are "LoadBalancingSinkProcessor" and "SinkSelector" classes, 
> but they cannot be used to achieve the target of replicating events from one 
> Channel to different Sinks.
> The following is an optional implementation of the Transactional Multiplex 
> (or fan out) Sink:
> 1. Add a Transactional Multiplex Sink Processor, which will

Re: [jira] [Created] (FLUME-1479) Multiple Sinks can connect to single Channel

2012-08-14 Thread Wang, Yongkun | Yongkun | BDD
Thanks Mike.

This is really a nice reply based on the thorough understanding of my
proposal.

I agree that it might be a potential design change. So I will carefully
evaluate it before submitting it to you guys to make the decision.

Cheers,
Yongkun Wang

On 12/08/13 9:17, "Mike Percy"  wrote:

>Hi,
>Due to design decisions made very early on in Flume NG - specifically the
>fact that Sink only has a simple process() method - I don't see a good way
>to get multiple sinks pulling from the same channel in a way that is
>backwards-compatible with the current implementation.
>
>Probably the "right" way to support this would be to have an interface
>where the SinkRunner (or something outside of each Sink) is in control of
>the transaction, and then it can easily send events to each sink serially
>or in parallel within a single transaction. I think that is basically what
>you are describing. If you look at SourceRunner and SourceProcessor you
>will see similar ideas to what you are describing but they are only
>implemented at the Source->Channel level. The current SinkProcessor is not
>an analog of SourceProcessor, but if it was then I think that's where this
>functionality might fit. However what happens when you do that is you have
>to handle a ton of failure cases and threading models in a very general
>way, which might be tough to get right for all use cases. I'm not 100%
>sure, but I think that's why this was not pursued at the time.
>
>To me, this seems like a potential design change (it would have to be very
>carefully thought out) to consider for a future major Flume code line
>(maybe a Flume 2.x).
>
>By the way, if one is trying to get maximum throughput, then duplicating
>events onto multiple channels, and having different threads running the
>sinks (the current design) will be faster and more resilient in general
>than a single thread and a single channel writing to multiple
>sinks/destinations. The multiple-channel design pattern will allow
>periodic
>downtimes or delays on a single sink to not affect the others, assuming
>the
>channel sizes are large enough for buffering during downtime and assuming
>that each sink is fast enough to recover from temporary delays. Without a
>dedicated buffer per destination, one is at the mercy of the slowest sink
>at every stage in the transaction.
>
>One last thing worth noting is that the current channels are all well
>ordered. This means that Flume currently provides a weak ordering
>guarantee
>(across a single hop). That is a helpful property in the context of
>testing
>and validation, as well as is what many people expect if they are storing
>logs on a single hop. I hope we don't backpedal on that weak ordering
>guarantee without a really good reason.
>
>Regards,
>Mike
>
>On Fri, Aug 10, 2012 at 9:30 PM, Wang, Yongkun | Yongkun | BDD <
>yongkun.w...@mail.rakuten.com> wrote:
>
>> Hi Jhhani,
>>
>> Yes, we can use two (or several) channels to fan out data to different
>> sinks. Then we will have two channels with same data, which may not be
>>an
>> optimized solution. So I want to use just ONE channel, creating a
>> processor to pull the data once from the channel, then distributing to
>> different sinks.
>>
>> Regards,
>> Yongkun Wang
>>
>> On 12/08/10 18:07, "Juhani Connolly" 
>> wrote:
>>
>> >Hi Yongkun,
>> >
>> >I'm curious why you need to pull the data twice from the sink? Do you
>> >need all sinks to have read the same amount of data? Normally for the
>> >case of splitting data into batch and analytics, we will send data from
>> >the source to two separate channels and have the sinks read from
>> >separate channels.
>> >
>> >On 08/10/2012 02:48 PM, Wang, Yongkun | Yongkun | BDD wrote:
>> >> Hi Denny,
>> >>
>> >> I am working on the patch now, it's not difficult. I have listed the
>> >> changes in that JIRA.
>> >> I think you misunderstand my design, I didn't maintain the order of
>>the
>> >> events. Instead I make sure that each sink will get the same events
>>(or
>> >> different events specified by selector).
>> >>
>> >> Suppose Channel (mc) contains the following events: 4,3,2,1
>> >>
>> >> If simply enable it by configuration, it may work like this:
>> >> Sink "hsa" may get 1,3;
>> >> Sink "hsb" may get 2,4;
>> >> So different sink will get different data. Is this what user wants?
>> >>
>> >>
>> >> In my design, "hsa" and "hsb" will both get "4,3,2,1". This is a
>>typical
>> >> case when user want to fan-out the data into two places (eg. One for
>> >>batch
>> >> and and another for real-time analysis).
>> >>
>> >> Regards,
>> >> Yongkun Wang
>> >>
>> >>
>> >> On 12/08/10 14:29, "Denny Ye"  wrote:
>> >>
>> >>> hi Yongkun,
>> >>>
>> >>>JIRA can be accessed now.
>> >>>
>> >>>I think it might be difficult to understand the order of events
>>from
>> >>> your thought. If we don't care about the order, can discuss the
>>value
>> >>>and
>> >>> feasibility.  In my opinion, data ingest flow is order unawareness,
>>at
>> >>> least, not such important for us.

Flume builds back online

2012-08-14 Thread Arvind Prabhakar
The flume builds were previously disabled due to repository change.
Updating the configuration and restricting it to the nodes that have Git
support seems to have worked:

https://builds.apache.org/job/flume-trunk/281/

I also took the liberty to enabling email notifications but in order to
minimize the overall mails generated reduced the frequency to daily instead
of the previous hourly frequency.

Regards,
Arvind Prabhakar


Re: Review Request: FLUME-1425: Create a SpoolDirectory Source and Client

2012-08-14 Thread Patrick Wendell

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6377/
---

(Updated Aug. 14, 2012, 10:02 p.m.)


Review request for Flume.


Changes
---

Updating description.


Description (updated)
---

This patch adds a spooling directory based source. The  idea is that a user can 
have a spool directory where files are deposited for ingestion into flume. Once 
ingested, the files are clearly renamed and the implementation guarantees 
at-least-once delivery semantics similar to those achieved within flume itself, 
even across failures and restarts of the JVM running the code.

This helps fill the gap for people who want a way to get reliable delivery of 
events into flume, but don't want to directly write their application against 
the flume API. They can simply drop log files off in a spooldir and let flume 
ingest asynchronously (using some shell scripts or other automated process).

Unlike the prior iteration, this patch implements a first-class source. It also 
extends the avro client to support spooling in a similar manner.


This addresses bug FlUME-1425.
https://issues.apache.org/jira/browse/FlUME-1425


Diffs
-

  
flume-ng-configuration/src/main/java/org/apache/flume/conf/source/SourceConfiguration.java
 da804d7 
  
flume-ng-configuration/src/main/java/org/apache/flume/conf/source/SourceType.java
 abbbf1c 
  flume-ng-core/src/main/java/org/apache/flume/client/avro/AvroCLIClient.java 
4a5ecae 
  
flume-ng-core/src/main/java/org/apache/flume/client/avro/BufferedLineReader.java
 PRE-CREATION 
  flume-ng-core/src/main/java/org/apache/flume/client/avro/LineReader.java 
PRE-CREATION 
  
flume-ng-core/src/main/java/org/apache/flume/client/avro/SpoolingFileLineReader.java
 PRE-CREATION 
  flume-ng-core/src/main/java/org/apache/flume/source/SpoolDirectorySource.java 
PRE-CREATION 
  
flume-ng-core/src/main/java/org/apache/flume/source/SpoolDirectorySourceConfigurationConstants.java
 PRE-CREATION 
  
flume-ng-core/src/test/java/org/apache/flume/client/avro/TestBufferedLineReader.java
 PRE-CREATION 
  
flume-ng-core/src/test/java/org/apache/flume/client/avro/TestSpoolingFileLineReader.java
 PRE-CREATION 
  
flume-ng-core/src/test/java/org/apache/flume/source/TestSpoolDirectorySource.java
 PRE-CREATION 
  flume-ng-doc/sphinx/FlumeUserGuide.rst 45dd7cc 

Diff: https://reviews.apache.org/r/6377/diff/


Testing
---

Extensive unit tests and I also built and played with this using a stub flume 
agent. If you look at the JIRA I have a configuration file for an agent that 
will print out Avro events to the command line - that's helpful when testing 
this.


Thanks,

Patrick Wendell



[jira] [Updated] (FLUME-1425) Create a SpoolDirectory Source and Client

2012-08-14 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLUME-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated FLUME-1425:
---

Attachment: FLUME-1425.v5.patch.txt

This patch is ready for review. It creates both a new source 
(SpoolDirectorySource) and adds a spooling directory capability to the existing 
avro client.

Includes extensive unit tests - probably best to start with those.

> Create a SpoolDirectory Source and Client
> -
>
> Key: FLUME-1425
> URL: https://issues.apache.org/jira/browse/FLUME-1425
> Project: Flume
>  Issue Type: Improvement
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
> Attachments: FLUME-1425.avro-conf-file.txt, FLUME-1425.patch.v1.txt, 
> FLUME-1425.v5.patch.txt
>
>
> The proposal is to create a small executable client which reads logs from a 
> spooling directory and sends them to a flume sink, then performs cleanup on 
> the directory (either by deleting or moving the logs). It would make the 
> following assumptions
> - Files placed in the directory are uniquely named
> - Files placed in the directory are immutable
> The problem this is trying to solve is that there is currently no way to do 
> guaranteed event delivery across flume agent restarts when the data is being 
> collected through an asynchronous source (and not directly from the client 
> API). Say, for instance, you are using a exec("tail -F") source. If the agent 
> restarts due to error or intentionally, tail may pick up at a new location 
> and you lose the intermediate data.
> At the same time, there are users who want at-least-once semantics, and 
> expect those to apply as soon as the data is written to disk from the initial 
> logger process (e.g. apache logs), not just once it has reached a flume 
> agent. This idea would bridge that gap, assuming the user is able to copy 
> immutable logs to a spooling directory through a cron script or something.
> The basic internal logic of such a client would be as follows:
> - Scan the directory for files
> - Chose a file and read through, while sending events to an agent
> - Close the file and delete it (or rename, or otherwise mark completed)
> That's about it. We could add sync-points to make recovery more efficient in 
> the case of failure.
> A key question is whether this should be implemented as a standalone client 
> or as a source. My instinct is actually to do this as a source, but there 
> could be some benefit to not requiring an entire agent in order to run this, 
> specifically that it would become platform independent and you could stick it 
> on Windows machines. Others I have talked to have also sided on a standalone 
> executable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Review Request: FLUME-1425: Create a SpoolDirectory Source and Client

2012-08-14 Thread Patrick Wendell

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6377/
---

(Updated Aug. 14, 2012, 9:54 p.m.)


Review request for Flume.


Changes
---

Updating title to change scope and adding whitespace fixes.


Summary (updated)
-

FLUME-1425: Create a SpoolDirectory Source and Client


Description
---

This patch extends the existing avro client to support a file-based spooling 
mechanism. See in-line documentation for precise details, but the basic idea is 
that a user can have a spool directory where files are deposited for ingestion 
into flume. Once ingested the files are clearly renamed and the implementation 
guarantees at-least-once delivery semantics similar to those achieved within 
flume itself, even across failures and restarts of the JVM running the code.

I feel vaguely uneasy about building this as part of the standlone avro client 
rather than as its own source. An alternative would be to build this as a 
proper source (in fact, there are some ad-hoc transaction semantics used here 
which would really be a better fit for a source). Interested in hearing 
feedback on that as well. The benefit of having this in the avro client is that 
you don't need the flume runner scripts which are not windows compatible.


This addresses bug FlUME-1425.
https://issues.apache.org/jira/browse/FlUME-1425


Diffs (updated)
-

  
flume-ng-configuration/src/main/java/org/apache/flume/conf/source/SourceConfiguration.java
 da804d7 
  
flume-ng-configuration/src/main/java/org/apache/flume/conf/source/SourceType.java
 abbbf1c 
  flume-ng-core/src/main/java/org/apache/flume/client/avro/AvroCLIClient.java 
4a5ecae 
  
flume-ng-core/src/main/java/org/apache/flume/client/avro/BufferedLineReader.java
 PRE-CREATION 
  flume-ng-core/src/main/java/org/apache/flume/client/avro/LineReader.java 
PRE-CREATION 
  
flume-ng-core/src/main/java/org/apache/flume/client/avro/SpoolingFileLineReader.java
 PRE-CREATION 
  flume-ng-core/src/main/java/org/apache/flume/source/SpoolDirectorySource.java 
PRE-CREATION 
  
flume-ng-core/src/main/java/org/apache/flume/source/SpoolDirectorySourceConfigurationConstants.java
 PRE-CREATION 
  
flume-ng-core/src/test/java/org/apache/flume/client/avro/TestBufferedLineReader.java
 PRE-CREATION 
  
flume-ng-core/src/test/java/org/apache/flume/client/avro/TestSpoolingFileLineReader.java
 PRE-CREATION 
  
flume-ng-core/src/test/java/org/apache/flume/source/TestSpoolDirectorySource.java
 PRE-CREATION 
  flume-ng-doc/sphinx/FlumeUserGuide.rst 45dd7cc 

Diff: https://reviews.apache.org/r/6377/diff/


Testing
---

Extensive unit tests and I also built and played with this using a stub flume 
agent. If you look at the JIRA I have a configuration file for an agent that 
will print out Avro events to the command line - that's helpful when testing 
this.


Thanks,

Patrick Wendell



Re: Transforming 1 event to n events

2012-08-14 Thread Jeremy Custenborder
Hi Mike,

I think I'm still blocked on this or I'll have to move the splitting
of the data up to the source which I know will work for sure. I've
just been trying to avoid it because I didn't want to deploy this to
all of the web servers.

I'm looking into the EventSerializer and I don't think it's going to
work for me either. All of the examples I've seen so far write data to
an output stream that seems to be the raw data file. It looks like
append is only called once per event. This prevents me from writing
multiple events as separate records in the squencefile on HDFS.
https://github.com/apache/flume/blob/trunk/flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSSequenceFile.java#L72

Am I off base here?
J

On Mon, Aug 13, 2012 at 8:59 PM, Mike Percy  wrote:
> On Mon, Aug 13, 2012 at 3:34 PM, Jeremy Custenborder <
> jcustenbor...@gmail.com> wrote:
>
>> I need to have the multiple objects available to
>> hive. The upstream object is actually a protobuf with hierarchy. I was
>> planning on flattening the object for hive. Here is an example of what
>> I'm collecting. The actual protobuf has many more fields, but this
>> gives you an idea.
>>
>> requestid
>> page
>> timestamp
>> useragent
>> impressions =[12345, 43212,12344,12345,43122, etc]
>>
>> transforming for each impression.
>>
>> requestid
>> page
>> timestamp
>> useragent
>> index
>> objectid
>>
>> This gives me one row in hive per impression. This might be a little
>> more contextual. I picked the earlier example because I didn't want to
>> get caught up in my use case.  I could move this code to serializers
>> buy I need to do similar logic twice since I'm incrementing a counter
>> in hbase per impression and adding a row per impression in hdfs(hive).
>> If I transformed the event to multiple events earlier in the pipe. I
>> would only have to write code to generate keys per event. At this
>> point I'm going to implement two serializers. One to handle hdfs and
>> one for hbase.
>>
>
> Hi Jeremy,
>
> Thanks for the extra color. It's an interesting flow. As more people
> continue to adopt Flume, I think we'll start to see patterns where the
> design or implementation of Flume is lacking and we can work towards
> bridging those gaps, and your use case provides valuable data on that. As
> for where we are now, I'm happy to hear that you have found a way forward.
>
> If you can keep us apprised as things progress with your Flume deployment I
> would love to hear about it!
>
> Regards,
> Mike


WriteableEvent and WriteableEventKey

2012-08-14 Thread Jeremy Custenborder
In flume-og there was a sink called seqfile which wrote to HDFS using
a key of WriteableEvent and WriteableEventKey. Is there an equivalent
in flume-ng? I currently have a few terabytes of data stored in this
format. I'd prefer to not have to move all of it to a different
format.

Thanks!
j


[jira] [Updated] (FLUME-1425) Create a SpoolDirectory Source and Client

2012-08-14 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLUME-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated FLUME-1425:
---

Summary: Create a SpoolDirectory Source and Client  (was: Create Standalone 
Spooling Client)

> Create a SpoolDirectory Source and Client
> -
>
> Key: FLUME-1425
> URL: https://issues.apache.org/jira/browse/FLUME-1425
> Project: Flume
>  Issue Type: Improvement
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
> Attachments: FLUME-1425.avro-conf-file.txt, FLUME-1425.patch.v1.txt
>
>
> The proposal is to create a small executable client which reads logs from a 
> spooling directory and sends them to a flume sink, then performs cleanup on 
> the directory (either by deleting or moving the logs). It would make the 
> following assumptions
> - Files placed in the directory are uniquely named
> - Files placed in the directory are immutable
> The problem this is trying to solve is that there is currently no way to do 
> guaranteed event delivery across flume agent restarts when the data is being 
> collected through an asynchronous source (and not directly from the client 
> API). Say, for instance, you are using a exec("tail -F") source. If the agent 
> restarts due to error or intentionally, tail may pick up at a new location 
> and you lose the intermediate data.
> At the same time, there are users who want at-least-once semantics, and 
> expect those to apply as soon as the data is written to disk from the initial 
> logger process (e.g. apache logs), not just once it has reached a flume 
> agent. This idea would bridge that gap, assuming the user is able to copy 
> immutable logs to a spooling directory through a cron script or something.
> The basic internal logic of such a client would be as follows:
> - Scan the directory for files
> - Chose a file and read through, while sending events to an agent
> - Close the file and delete it (or rename, or otherwise mark completed)
> That's about it. We could add sync-points to make recovery more efficient in 
> the case of failure.
> A key question is whether this should be implemented as a standalone client 
> or as a source. My instinct is actually to do this as a source, but there 
> could be some benefit to not requiring an entire agent in order to run this, 
> specifically that it would become platform independent and you could stick it 
> on Windows machines. Others I have talked to have also sided on a standalone 
> executable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: [DISCUSS] Annotating interface scope/stability in Flume

2012-08-14 Thread Patrick Wendell
We definitely need something like this, given that extensability is
the expectation in Flume (not the exception).

It's likely that the flume developer community will overlap with
Hadoop developer community, so using a subset of the Hadoop
annotations makes sense. @Private and @Public seem like a good fit for
what we need right now.

- Patrick

On Tue, Aug 14, 2012 at 10:29 AM, Ralph Goers
 wrote:
>
> On Aug 14, 2012, at 9:47 AM, Will McQueen wrote:
>
>> Something to consider: Eclipse uses the string 'internal' in their package
>> path to denote packages (public or otherwise) that are not intended to be
>> used as a public API:
>>
>> "All packages that are part of the platform implementation but contain no
>> API that should be exposed to ISVs are considered internal implementation
>> packages. All implementation packages should be flagged as internal, with
>> the tag occurring just after the major package name. ISVs will be told that
>> all packages marked internal are out of bounds. (A simple text search for
>> ".internal." detects suspicious reference in source files; likewise,
>> "/internal/" is suspicious in .class files). "
>
> FWIW, I also think this is a good idea.
>
>
>> [Ref:
>> http://wiki.eclipse.org/Naming_Conventions#Internal_Implementation_Packages]
>>
>> Here are some additional links on evolving Java APIs:
>> http://wiki.eclipse.org/Evolving_Java-based_APIs
>> http://wiki.eclipse.org/Evolving_Java-based_APIs_2
>> http://wiki.eclipse.org/Evolving_Java-based_APIs_3
>>
 Flume configuration is actually pretty brittle
>> FLUME-1051 might address this concern.
>>
> I'm not sure how. That issues seems more about guaranteeing validity than 
> making the configuration provider more flexible.
>
>
> Ralph
>


Re: Review Request: HDFS file handle not closed properly when date bucketing

2012-08-14 Thread Yongcheng Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6578/
---

(Updated Aug. 14, 2012, 6:39 p.m.)


Review request for Flume.


Changes
---

Address Brock's two comments.


Description
---

This resolves Flume-1350 which describes a problem of Flume where HDFS file 
handle does not close properly when date bucketing. The fix adds a map between 
the sink's path and its real path. Whenever a new real path is generated due to 
data bucketing, it closes the bucketwriter associated with existing real path 
and update the link between the sink's path and its real path.


This addresses bug Flume-1350.
https://issues.apache.org/jira/browse/Flume-1350


Diffs (updated)
-

  
flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSEventSink.java
 fcb9642 

Diff: https://reviews.apache.org/r/6578/diff/


Testing
---

The fix has been manually tested.


Thanks,

Yongcheng Li



Re: [DISCUSS] Annotating interface scope/stability in Flume

2012-08-14 Thread Ralph Goers

On Aug 14, 2012, at 9:47 AM, Will McQueen wrote:

> Something to consider: Eclipse uses the string 'internal' in their package
> path to denote packages (public or otherwise) that are not intended to be
> used as a public API:
> 
> "All packages that are part of the platform implementation but contain no
> API that should be exposed to ISVs are considered internal implementation
> packages. All implementation packages should be flagged as internal, with
> the tag occurring just after the major package name. ISVs will be told that
> all packages marked internal are out of bounds. (A simple text search for
> ".internal." detects suspicious reference in source files; likewise,
> "/internal/" is suspicious in .class files). "

FWIW, I also think this is a good idea.


> [Ref:
> http://wiki.eclipse.org/Naming_Conventions#Internal_Implementation_Packages]
> 
> Here are some additional links on evolving Java APIs:
> http://wiki.eclipse.org/Evolving_Java-based_APIs
> http://wiki.eclipse.org/Evolving_Java-based_APIs_2
> http://wiki.eclipse.org/Evolving_Java-based_APIs_3
> 
>>> Flume configuration is actually pretty brittle
> FLUME-1051 might address this concern.
> 
I'm not sure how. That issues seems more about guaranteeing validity than 
making the configuration provider more flexible.


Ralph



Re: [DISCUSS] Annotating interface scope/stability in Flume

2012-08-14 Thread Ralph Goers

On Aug 14, 2012, at 9:47 AM, Will McQueen wrote:

> 
> 
>>> if you want to create a configuration using XML, JSON
> Flume is currently hardcoded to read from a Java Properties file. Please
> see Application.run():
> AbstractFileConfigurationProvider configurationProvider = new
> PropertiesFileConfigurationProvider();
> 

Yes, I know.  That is only part of it. FlumeConfiguration also expects 
properties.  PropertiesFileConfigurationProvider cannot be extended, even 
though it sort of looks like it can. All the useful methods are private.

It turns out I can't really leverage the way Flume startup works anyway. When 
the configuration changes it basically does a restart. That works fine in a 
standalone case since clients will fail over to another agent until 
reconfiguration is complete. But an embedded agent can't shutdown like that so 
I'm pretty much having to roll my own.

Ralph



Re: [DISCUSS] Annotating interface scope/stability in Flume

2012-08-14 Thread Arun C Murthy

On Aug 14, 2012, at 2:38 AM, Mike Percy wrote:

> Right now, I feel we would get most of the bang for the buck simply by
> adding two annotations: @Public and @Internal, which to me means "you can
> subclass or instantiate this directly", or "you can't directly use this if
> you expect future compatibility".

Why not keep the same nomenclature as in Hadoop? It will help keeping some sort 
of commonality in the ecosystem?

Arun

Re: [DISCUSS] Annotating interface scope/stability in Flume

2012-08-14 Thread Will McQueen
Something to consider: Eclipse uses the string 'internal' in their package
path to denote packages (public or otherwise) that are not intended to be
used as a public API:

"All packages that are part of the platform implementation but contain no
API that should be exposed to ISVs are considered internal implementation
packages. All implementation packages should be flagged as internal, with
the tag occurring just after the major package name. ISVs will be told that
all packages marked internal are out of bounds. (A simple text search for
".internal." detects suspicious reference in source files; likewise,
"/internal/" is suspicious in .class files). "
[Ref:
http://wiki.eclipse.org/Naming_Conventions#Internal_Implementation_Packages]

Here are some additional links on evolving Java APIs:
http://wiki.eclipse.org/Evolving_Java-based_APIs
http://wiki.eclipse.org/Evolving_Java-based_APIs_2
http://wiki.eclipse.org/Evolving_Java-based_APIs_3

>> Flume configuration is actually pretty brittle
FLUME-1051 might address this concern.

>>if you want to create a configuration using XML, JSON
Flume is currently hardcoded to read from a Java Properties file. Please
see Application.run():
 AbstractFileConfigurationProvider configurationProvider = new
PropertiesFileConfigurationProvider();

Cheers,
Will

On Tue, Aug 14, 2012 at 9:14 AM, Ralph Goers wrote:

> I have a slightly different take on this.
>
> I've been trying to embed Flume within the Log4j 2 client and have found
> that things that look like they are extendable actually aren't and that
> Flume configuration is actually pretty brittle (the properties don't seem
> to be isolated to one place but percolate through several), such that if
> you want to create a configuration using XML, JSON or anything else you are
> forced to convert that into the property syntax.  I've also noticed that
> the Sources, Sinks and Channels are all defined in enums - another brittle
> construct that puts third party add-ons at a disadvantage from stuff
> packaged with Flume.
>
> I tackled a lot of these issues in Log4j 2 and came up with different
> solutions for them.  I used annotations, but not specifically to mark
> things as public or private but to identify "plugins".  For example,
> Sources, Sinks and Channels could all be annotated and be made available by
> a short name specified on the annotation which would get rid of the need
> for the enums.  Of course, these also identify components that can be used
> as models for developers and users to emulate to add their own components.
>
> While addition annotations as guidance to programmers is OK, you can
> accomplish the same thing just by writing good Javadoc.  I'm more a fan of
> using annotations for things that are a bit more useful.
>
> Ralph
>
>
> On Aug 14, 2012, at 2:38 AM, Mike Percy wrote:
>
> > It seems we have reached a point in some of the Flume components where we
> > want to add features but that means adding new interfaces to maintain
> > backwards compatibility. While this is natural, the more we do it, and
> the
> > more we cast from interface to interface, the less readable and
> consistent
> > the codebase becomes. Also, we have exposed much of our code as public
> > class + public method, even when APIs may not be intended as stable
> > extension points, for testing or other reasons. A few years ago, Hadoop
> > faced this problem and ended up implementing annotations to document APIs
> > as @Stable/@Evolving, @Public/@Limited/@Private. See <
> > https://issues.apache.org/jira/browse/HADOOP-5073> for the history on
> that.
> >
> > I would like to propose the adoption of a similar mechanism in Flume, in
> > order to give us more wiggle room in the future for evolutionary
> > development. Thoughts?
> >
> > Right now, I feel we would get most of the bang for the buck simply by
> > adding two annotations: @Public and @Internal, which to me means "you can
> > subclass or instantiate this directly", or "you can't directly use this
> if
> > you expect future compatibility".
> >
> > Regards,
> > Mike
>
>


Re: [DISCUSS] Annotating interface scope/stability in Flume

2012-08-14 Thread Ralph Goers
I have a slightly different take on this.

I've been trying to embed Flume within the Log4j 2 client and have found that 
things that look like they are extendable actually aren't and that Flume 
configuration is actually pretty brittle (the properties don't seem to be 
isolated to one place but percolate through several), such that if you want to 
create a configuration using XML, JSON or anything else you are forced to 
convert that into the property syntax.  I've also noticed that the Sources, 
Sinks and Channels are all defined in enums - another brittle construct that 
puts third party add-ons at a disadvantage from stuff packaged with Flume.

I tackled a lot of these issues in Log4j 2 and came up with different solutions 
for them.  I used annotations, but not specifically to mark things as public or 
private but to identify "plugins".  For example, Sources, Sinks and Channels 
could all be annotated and be made available by a short name specified on the 
annotation which would get rid of the need for the enums.  Of course, these 
also identify components that can be used as models for developers and users to 
emulate to add their own components.

While addition annotations as guidance to programmers is OK, you can accomplish 
the same thing just by writing good Javadoc.  I'm more a fan of using 
annotations for things that are a bit more useful.

Ralph


On Aug 14, 2012, at 2:38 AM, Mike Percy wrote:

> It seems we have reached a point in some of the Flume components where we
> want to add features but that means adding new interfaces to maintain
> backwards compatibility. While this is natural, the more we do it, and the
> more we cast from interface to interface, the less readable and consistent
> the codebase becomes. Also, we have exposed much of our code as public
> class + public method, even when APIs may not be intended as stable
> extension points, for testing or other reasons. A few years ago, Hadoop
> faced this problem and ended up implementing annotations to document APIs
> as @Stable/@Evolving, @Public/@Limited/@Private. See <
> https://issues.apache.org/jira/browse/HADOOP-5073> for the history on that.
> 
> I would like to propose the adoption of a similar mechanism in Flume, in
> order to give us more wiggle room in the future for evolutionary
> development. Thoughts?
> 
> Right now, I feel we would get most of the bang for the buck simply by
> adding two annotations: @Public and @Internal, which to me means "you can
> subclass or instantiate this directly", or "you can't directly use this if
> you expect future compatibility".
> 
> Regards,
> Mike



[jira] [Resolved] (FLUME-1420) Exception should be thrown if we cannot instaniate an EventSerializer

2012-08-14 Thread Brock Noland (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLUME-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brock Noland resolved FLUME-1420.
-

   Resolution: Fixed
Fix Version/s: v1.3.0

Committed here 
https://git-wip-us.apache.org/repos/asf?p=flume.git;a=commit;h=e01403004af5aa5964f1df3312549c7bb9e2c686

Thank you for your contribution Patrick!

> Exception should be thrown if we cannot instaniate an EventSerializer
> -
>
> Key: FLUME-1420
> URL: https://issues.apache.org/jira/browse/FLUME-1420
> Project: Flume
>  Issue Type: Bug
>  Components: Sinks+Sources
>Affects Versions: v1.2.0
>Reporter: Brock Noland
>Assignee: Patrick Wendell
> Fix For: v1.3.0
>
> Attachments: FLUME-1420.patch.v1.txt, FLUME-1420.patch.v2.txt
>
>
> Currently EventSerailizerFactory returns null if it cannot instantiate the 
> class. Then the caller NPEs because they don't expect null. If we cannot 
> satisfy the caller we should throw an exception
> {noformat}
> 2012-08-02 16:38:26,489 ERROR serialization.EventSerializerFactory: Unable to 
> instantiate Builder from 
> org.apache.flume.serialization.BodyTextEventSerializer
> 2012-08-02 16:38:26,490 WARN hdfs.HDFSEventSink: HDFS IO error
> java.io.IOException: java.lang.NullPointerException
> at 
> org.apache.flume.sink.hdfs.BucketWriter.doOpen(BucketWriter.java:202)
> at 
> org.apache.flume.sink.hdfs.BucketWriter.access$000(BucketWriter.java:48)
> at 
> org.apache.flume.sink.hdfs.BucketWriter$1.run(BucketWriter.java:155)
> at 
> org.apache.flume.sink.hdfs.BucketWriter$1.run(BucketWriter.java:152)
> at 
> org.apache.flume.sink.hdfs.BucketWriter.runPrivileged(BucketWriter.java:125)
> at org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:152)
> at 
> org.apache.flume.sink.hdfs.BucketWriter.append(BucketWriter.java:307)
> at 
> org.apache.flume.sink.hdfs.HDFSEventSink$1.call(HDFSEventSink.java:717)
> at 
> org.apache.flume.sink.hdfs.HDFSEventSink$1.call(HDFSEventSink.java:714)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> Caused by: java.lang.NullPointerException
> at 
> org.apache.flume.sink.hdfs.HDFSDataStream.open(HDFSDataStream.java:75)
> at 
> org.apache.flume.sink.hdfs.BucketWriter.doOpen(BucketWriter.java:188)
> ... 13 more
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: [DISCUSS] Annotating interface scope/stability in Flume

2012-08-14 Thread Jarek Jarcec Cecho
I'm also in favour of that idea.

Jarcec

On Tue, Aug 14, 2012 at 11:40:17AM +0100, Brock Noland wrote:
> Hi,
> 
> I think this makes sense. Can we clearly state and define a contract for
> Public and Internal?
> 
> Brock
> 
> On Tue, Aug 14, 2012 at 10:38 AM, Mike Percy  wrote:
> 
> > It seems we have reached a point in some of the Flume components where we
> > want to add features but that means adding new interfaces to maintain
> > backwards compatibility. While this is natural, the more we do it, and the
> > more we cast from interface to interface, the less readable and consistent
> > the codebase becomes. Also, we have exposed much of our code as public
> > class + public method, even when APIs may not be intended as stable
> > extension points, for testing or other reasons. A few years ago, Hadoop
> > faced this problem and ended up implementing annotations to document APIs
> > as @Stable/@Evolving, @Public/@Limited/@Private. See <
> > https://issues.apache.org/jira/browse/HADOOP-5073> for the history on
> > that.
> >
> > I would like to propose the adoption of a similar mechanism in Flume, in
> > order to give us more wiggle room in the future for evolutionary
> > development. Thoughts?
> >
> > Right now, I feel we would get most of the bang for the buck simply by
> > adding two annotations: @Public and @Internal, which to me means "you can
> > subclass or instantiate this directly", or "you can't directly use this if
> > you expect future compatibility".
> >
> > Regards,
> > Mike
> >
> 
> 
> 
> -- 
> Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/


signature.asc
Description: Digital signature


Re: [DISCUSS] Annotating interface scope/stability in Flume

2012-08-14 Thread Brock Noland
Hi,

I think this makes sense. Can we clearly state and define a contract for
Public and Internal?

Brock

On Tue, Aug 14, 2012 at 10:38 AM, Mike Percy  wrote:

> It seems we have reached a point in some of the Flume components where we
> want to add features but that means adding new interfaces to maintain
> backwards compatibility. While this is natural, the more we do it, and the
> more we cast from interface to interface, the less readable and consistent
> the codebase becomes. Also, we have exposed much of our code as public
> class + public method, even when APIs may not be intended as stable
> extension points, for testing or other reasons. A few years ago, Hadoop
> faced this problem and ended up implementing annotations to document APIs
> as @Stable/@Evolving, @Public/@Limited/@Private. See <
> https://issues.apache.org/jira/browse/HADOOP-5073> for the history on
> that.
>
> I would like to propose the adoption of a similar mechanism in Flume, in
> order to give us more wiggle room in the future for evolutionary
> development. Thoughts?
>
> Right now, I feel we would get most of the bang for the buck simply by
> adding two annotations: @Public and @Internal, which to me means "you can
> subclass or instantiate this directly", or "you can't directly use this if
> you expect future compatibility".
>
> Regards,
> Mike
>



-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/


[DISCUSS] Annotating interface scope/stability in Flume

2012-08-14 Thread Mike Percy
It seems we have reached a point in some of the Flume components where we
want to add features but that means adding new interfaces to maintain
backwards compatibility. While this is natural, the more we do it, and the
more we cast from interface to interface, the less readable and consistent
the codebase becomes. Also, we have exposed much of our code as public
class + public method, even when APIs may not be intended as stable
extension points, for testing or other reasons. A few years ago, Hadoop
faced this problem and ended up implementing annotations to document APIs
as @Stable/@Evolving, @Public/@Limited/@Private. See <
https://issues.apache.org/jira/browse/HADOOP-5073> for the history on that.

I would like to propose the adoption of a similar mechanism in Flume, in
order to give us more wiggle room in the future for evolutionary
development. Thoughts?

Right now, I feel we would get most of the bang for the buck simply by
adding two annotations: @Public and @Internal, which to me means "you can
subclass or instantiate this directly", or "you can't directly use this if
you expect future compatibility".

Regards,
Mike


[jira] [Updated] (FLUME-1479) Multiple Sinks can connect to single Channel

2012-08-14 Thread Denny Ye (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLUME-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denny Ye updated FLUME-1479:


Attachment: FLUME-1479.patch

> Multiple Sinks can connect to single Channel
> 
>
> Key: FLUME-1479
> URL: https://issues.apache.org/jira/browse/FLUME-1479
> Project: Flume
>  Issue Type: Bug
>  Components: Configuration
>Affects Versions: v1.2.0
>Reporter: Denny Ye
>Assignee: Denny Ye
> Fix For: v1.3.0
>
> Attachments: FLUME-1479.patch
>
>
> If we has one Channel (mc) and two Sinks (hsa, hsb), then they may be 
> connected with each other with configuration example
> {quote}
> agent.sinks.hsa.channel = mc
> agent.sinks.hsb.channel = mc
> {quote}
> It means that there have multiple Sinks can connect to single Channel. 
> Normally, one Sink only can connect to unified Channel

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: [jira] [Moved] (FLUME-1483) Implement command "show framework" end to end

2012-08-14 Thread Mike Percy
Too bad. Thanks for the info Jarcec!

Regards,
Mike

On Mon, Aug 13, 2012 at 11:49 PM, Jarek Jarcec Cecho wrote:

> The manual has mentioned two different ways how to fix this issue (move
> issue around, run integrity check). Neither of them actually worked :-/
>
> I'm afraid that we will need to delete the problematic JIRAs and recreate
> them manually.
>
> Jarcec
>
> On Mon, Aug 13, 2012 at 10:50:50PM -0700, Mike Percy wrote:
> > Hi Jarcec,
> > Seems moving it between projects did not work?
> >
> > Regards,
> > Mike
> >
> > On Mon, Aug 13, 2012 at 1:33 AM, Jarek Jarcec Cecho  >wrote:
> >
> > > I've moved it to flume so that I can move it back to sqoop to solve
> > > missing Workflow actions:
> > >
> > >
> > >
> https://confluence.atlassian.com/display/JIRAKB/Missing+Available+Workflow+Actions
> > >
> > > Apparently I do not have enough privileges to move it back :-) Might I
> ask
> > > for higher JIRA permissions?
> > >
> > > Jarcec
> > >
> > > On Mon, Aug 13, 2012 at 07:30:37PM +1100, Jarek Jarcec Cecho (JIRA)
> wrote:
> > > >
> > > >  [
> > >
> https://issues.apache.org/jira/browse/FLUME-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
> > > >
> > > > Jarek Jarcec Cecho moved SQOOP-569 to FLUME-1483:
> > > > -
> > > >
> > > > Release Note:
> > > > (Move ticket around to solve workflow issues on JIRA side:
> > > >
> > > >
> > >
> https://confluence.atlassian.com/display/JIRAKB/Missing+Available+Workflow+Actions
> > > >  Key: FLUME-1483  (was: SQOOP-569)
> > > >  Project: Flume  (was: Sqoop)
> > > >
> > > > > Implement command "show framework" end to end
> > > > > -
> > > > >
> > > > > Key: FLUME-1483
> > > > > URL:
> https://issues.apache.org/jira/browse/FLUME-1483
> > > > > Project: Flume
> > > > >  Issue Type: Task
> > > > >Reporter: Jarek Jarcec Cecho
> > > > >Assignee: Jarek Jarcec Cecho
> > > > >Priority: Minor
> > > > >
> > > > > Implement command "show framework" to show framework metadata from
> end
> > > to end.
> > > >
> > > > --
> > > > This message is automatically generated by JIRA.
> > > > If you think it was sent incorrectly, please contact your JIRA
> > > administrators:
> > >
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> > > > For more information on JIRA, see:
> > > http://www.atlassian.com/software/jira
> > > >
> > > >
> > >
>


[jira] [Updated] (FLUME-1382) Flume adopt message from existing local Scribe

2012-08-14 Thread Denny Ye (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLUME-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denny Ye updated FLUME-1382:


Attachment: (was: FLUME-1382-doc.patch)

> Flume adopt message from existing local Scribe
> --
>
> Key: FLUME-1382
> URL: https://issues.apache.org/jira/browse/FLUME-1382
> Project: Flume
>  Issue Type: Improvement
>  Components: Sinks+Sources
>Affects Versions: v1.2.0
>Reporter: Denny Ye
>Assignee: Denny Ye
>Priority: Minor
>  Labels: scribe, thrift
> Fix For: v1.3.0
>
> Attachments: FLUME-1382-3.patch, FLUME-1382-doc-2.patch
>
>
> Currently, we are using Scribe in data ingest system. Central Scribe is hard 
> to maintain and upgrade. Thus, we would like to replace central Scribe with 
> Flume and adopt message from existing and amounts of local Scribe. This can 
> be treated as legacy part.
> We have generated ScribeSource and used with more effective Thrift code 
> without deserializing.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (FLUME-1382) Flume adopt message from existing local Scribe

2012-08-14 Thread Denny Ye (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLUME-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denny Ye updated FLUME-1382:


Attachment: FLUME-1382-3.patch
FLUME-1382-doc-2.patch

> Flume adopt message from existing local Scribe
> --
>
> Key: FLUME-1382
> URL: https://issues.apache.org/jira/browse/FLUME-1382
> Project: Flume
>  Issue Type: Improvement
>  Components: Sinks+Sources
>Affects Versions: v1.2.0
>Reporter: Denny Ye
>Assignee: Denny Ye
>Priority: Minor
>  Labels: scribe, thrift
> Fix For: v1.3.0
>
> Attachments: FLUME-1382-3.patch, FLUME-1382-doc-2.patch
>
>
> Currently, we are using Scribe in data ingest system. Central Scribe is hard 
> to maintain and upgrade. Thus, we would like to replace central Scribe with 
> Flume and adopt message from existing and amounts of local Scribe. This can 
> be treated as legacy part.
> We have generated ScribeSource and used with more effective Thrift code 
> without deserializing.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (FLUME-1382) Flume adopt message from existing local Scribe

2012-08-14 Thread Denny Ye (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLUME-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denny Ye updated FLUME-1382:


Attachment: (was: FLUME-1382.patch)

> Flume adopt message from existing local Scribe
> --
>
> Key: FLUME-1382
> URL: https://issues.apache.org/jira/browse/FLUME-1382
> Project: Flume
>  Issue Type: Improvement
>  Components: Sinks+Sources
>Affects Versions: v1.2.0
>Reporter: Denny Ye
>Assignee: Denny Ye
>Priority: Minor
>  Labels: scribe, thrift
> Fix For: v1.3.0
>
> Attachments: FLUME-1382-3.patch, FLUME-1382-doc-2.patch
>
>
> Currently, we are using Scribe in data ingest system. Central Scribe is hard 
> to maintain and upgrade. Thus, we would like to replace central Scribe with 
> Flume and adopt message from existing and amounts of local Scribe. This can 
> be treated as legacy part.
> We have generated ScribeSource and used with more effective Thrift code 
> without deserializing.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira