Re: Review Request 34500: SAMZA-552 Operator API change: builder and simplified operator classes

2015-05-29 Thread Yi Pan (Data Infrastructure)
> On May 29, 2015, 2:29 a.m., Milinda Pathirage wrote: > > samza-sql-core/src/main/java/org/apache/samza/sql/api/operators/OperatorRouter.java, > > line 60 > > > > > > It's better if we can discuss about the ordering g

Re: Review Request 34500: SAMZA-552 Operator API change: builder and simplified operator classes

2015-05-29 Thread Yi Pan (Data Infrastructure)
> On May 29, 2015, 1:28 a.m., Navina Ramesh wrote: > > samza-sql-core/src/main/java/org/apache/samza/sql/api/operators/OperatorSource.java, > > line 26 > > > > > > OperatorSource and OperatorSink have the same method s

Re: Review Request 34500: SAMZA-552 Operator API change: builder and simplified operator classes

2015-05-29 Thread Yi Pan (Data Infrastructure)
> On May 29, 2015, 1:28 a.m., Navina Ramesh wrote: > > samza-sql-core/src/test/java/org/apache/samza/task/sql/UserCallbacksSqlTask.java, > > line 120 > > > > > > I thought TopologyBuilder was to abstract away the spec

Re: Review Request 33280: [SAMZA-561] Basic streaming SQL query planning support

2015-05-29 Thread Milinda Pathirage
> On May 26, 2015, 7:40 a.m., Yi Pan (Data Infrastructure) wrote: > > samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/schema/Stream.java, > > line 80 > > > > > > Question: do we have a way to specify the p

Review Request 34812: SAMZA-687 Add samza-graphite package to hello-samza

2015-05-29 Thread Gustavo Anatoly F . V . Solís
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34812/ --- Review request for samza. Repository: samza-hello-samza Description ---

Re: ProcessJobFactory parent process

2015-05-29 Thread Lukas Steiblys
Hi Yan, The memory usage is not very high, but I'm trying to cut the usage any way I can. The bigger problem is when the job crashes and the parent process stays active preventing an auto restart by the Docker supervisor. Lukas On Thursday, May 28, 2015, Yan Fang wrote: > Hi Lukas, > > The pa

Re: Review Request 34746: Adding new CoordinatorStreamMessage "SetContainerHostMapping" and LocalityManager (SAMZA-618)

2015-05-29 Thread Navina Ramesh
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34746/ --- (Updated May 29, 2015, 6:32 p.m.) Review request for samza, Chris Riccomini, Gu

Re: ProcessJobFactory parent process

2015-05-29 Thread Yan Fang
Hi Lukas, This sounds like a bug to me. Can you file a JIRA for this? We will have a look at it. Thanks, Fang, Yan yanfang...@gmail.com On Fri, May 29, 2015 at 8:39 AM, Lukas Steiblys wrote: > Hi Yan, > > The memory usage is not very high, but I'm trying to cut the usage any way > I can. > >

Re: Review Request 34500: SAMZA-552 Operator API change: builder and simplified operator classes

2015-05-29 Thread Navina Ramesh
> On May 29, 2015, 1:28 a.m., Navina Ramesh wrote: > > samza-sql-core/src/test/java/org/apache/samza/task/sql/UserCallbacksSqlTask.java, > > line 120 > > > > > > I thought TopologyBuilder was to abstract away the spec

Re: ProcessJobFactory parent process

2015-05-29 Thread Yi Pan
Hi, Lukas, I assume that when you say "the job crashes", you were referring to the child process running the container, not the parent process? If yes, we were actually talking about adding container health-check/failure-detection in the JobCoordinator. SAMZA-680 would be the good place to start t

Re: ProcessJobFactory parent process

2015-05-29 Thread Lukas Steiblys
Yes, I'm talking about the child process crashing. I'd like the parent to die as well if the child crashes so Docker can understand that the process failed and restart the container. Lukas -Original Message- From: Yi Pan Sent: Friday, May 29, 2015 12:47 PM To: dev@samza.apache.org Su

Re: Review Request 34746: Adding new CoordinatorStreamMessage "SetContainerHostMapping" and LocalityManager (SAMZA-618)

2015-05-29 Thread Yan Fang
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34746/#review85798 --- samza-core/src/main/java/org/apache/samza/coordinator/stream/Coordi

Re: ProcessJobFactory parent process

2015-05-29 Thread Yi Pan
Hi, Lukas, Yes. That's exactly part of the feature to allow health-check/failure-detection of containers. Another short-term option is trying to use ThreadJobFactory, which has the JobCoordinator and containers in the same process. Does that work for your use case? -Yi On Fri, May 29, 2015 at 12

Re: Review Request 34746: Adding new CoordinatorStreamMessage "SetContainerHostMapping" and LocalityManager (SAMZA-618)

2015-05-29 Thread Navina Ramesh
> On May 29, 2015, 8:38 p.m., Yan Fang wrote: > > samza-core/src/main/java/org/apache/samza/coordinator/stream/CoordinatorStreamMessage.java, > > line 503 > > > > > > Changing to "host" or "ip addree" is better. Becau

Reprocessing old events no longer in Kafka

2015-05-29 Thread Zach Cox
Hi - Let's say one day a company wants to start doing all of this awesome data integration/near-real-time stream processing stuff, so they start sending their user activity events (e.g. pageviews, ad impressions, etc) to Kafka. Then they hook up Camus to copy new events from Kafka to HDFS every ho

Re: Reprocessing old events no longer in Kafka

2015-05-29 Thread Benjamin Black
Why not run a map reduce job on the data in hdfs? what is was made for. On May 29, 2015 2:13 PM, "Zach Cox" wrote: > Hi - > > Let's say one day a company wants to start doing all of this awesome data > integration/near-real-time stream processing stuff, so they start sending > their user activity

Re: Reprocessing old events no longer in Kafka

2015-05-29 Thread Zach Cox
Let's also add to the story: say the company wants to only write code for Samza, and not duplicate the same code in MapReduce jobs (or any other framework). On Fri, May 29, 2015 at 4:16 PM Benjamin Black wrote: > Why not run a map reduce job on the data in hdfs? what is was made for. > On May 29

Re: ProcessJobFactory parent process

2015-05-29 Thread Lukas Steiblys
Yes, I think switching to ThreadJobFactory is a good solution. I think the reasons why I switched to ProcessJobFactory earlier no longer hold true. Thanks. Lukas -Original Message- From: Yi Pan Sent: Friday, May 29, 2015 2:05 PM To: dev@samza.apache.org Subject: Re: ProcessJobFactory

Re: Reprocessing old events no longer in Kafka

2015-05-29 Thread Zach Cox
(continuing from previous email) in addition to not wanting to duplicate code, say that some of the Samza jobs need to build up state, and it's important to build up this state from all of those old events no longer in Kafka. If that state was only built from the last 7 days of events, some things

RE: Reprocessing old events no longer in Kafka

2015-05-29 Thread Felix GV
We developed camus2kafka at my previous job for this purpose of re-pushing events to Kafka: https://github.com/mate1/camus2kafka Using camus2kafka, it was then possible to "stitch" the consumers (after they're done re-processing the "replay topic") back onto the regular topic at the precise off

Re: Reprocessing old events no longer in Kafka

2015-05-29 Thread Navina Ramesh
Hi Zach, It sounds like you are asking for a SystemConsumer for hdfs. Does SAMZA-263 match your requirements? Thanks! Navina On 5/29/15, 2:23 PM, "Zach Cox" wrote: >(continuing from previous email) in addition to not wanting to duplicate >code, say that some of the Samza jobs need to build up

Re: Reprocessing old events no longer in Kafka

2015-05-29 Thread Thomas Bernhardt
I think the application would want to replay historical events into Samza. I.e the application can replay any events older then X days from HDFS into Samza. Once Samza has processed the historical events, the application can switch input to the Kafka queue to process the more recent and finally

Re: Reprocessing old events no longer in Kafka

2015-05-29 Thread Zach Cox
Hi Felix - that sounds like a very interesting approach! I was trying to think through how to dump the events hdfs => kafka but got stuck on the transition point, didn't even think about using the offsets from camus. I will definitely investigate further. Thanks, Zach On Fri, May 29, 2015 at 4:2

Re: Reprocessing old events no longer in Kafka

2015-05-29 Thread Navina Ramesh
That said, since we don¹t yet support consuming from hdfs, one workaround would be to periodically read from hdfs and pump the data to a kafka topic (say topic A) using a hadoop / yarn based job. Then, in your Samza job, you can bootstrap from topic A and then, continue processing the latest messag

Re: Reprocessing old events no longer in Kafka

2015-05-29 Thread Zach Cox
Hi Navina, I did see that jira and it would definitely be useful. I was thinking of maybe trying to build a composite stream, that would first read old events from hdfs and then switch over to kafka. Do you know if there has been any movement on treating hdfs as a samza stream? Thanks, Zach On

Re: Reprocessing old events no longer in Kafka

2015-05-29 Thread Zach Cox
Hi Thomas, That definitely seems like a good approach - just need to figure out the details of consuming old events from hdfs and then seamlessly switching over to kafka for newer events. Seems like some new components of Samza need to be built to do this. Thanks, Zach On Fri, May 29, 2015 at 4

Re: Reprocessing old events no longer in Kafka

2015-05-29 Thread Navina Ramesh
Hi Zach, Regarding the JIRA, it is assigned to Jakob Homan. He will be the right person to comment on that. Thanks! Navina On 5/29/15, 2:33 PM, "Zach Cox" wrote: >Hi Navina, > >I did see that jira and it would definitely be useful. I was thinking of >maybe trying to build a composite stream, th

RE: Reprocessing old events no longer in Kafka

2015-05-29 Thread Felix GV
Even if reading directly from HDFS, the matter of transitioning from re-reprocessing back to real-time is a bit problematic unless you are tapping into data ingested by Camus (or something else which has offset metadata recorded alongside the data). Granted, Kafka already has just "at least onc

Re: Reprocessing old events no longer in Kafka

2015-05-29 Thread Zach Cox
Hi Navina, A similar approach I considered was using an infinite/very large retention period on those event kafka topics, so they would always contain all historical events. Then standard Samza reprocessing goes through all old events. I'm hesitant to pursue that though, as those topic partitions

Re: Reprocessing old events no longer in Kafka

2015-05-29 Thread Zach Cox
Hi Felix - that is an excellent point: let's assume in this case that the Samza processing is not perfectly idempotent, so duplicating 7 days of events is not OK, but there is tolerance for duplicating several seconds/minutes of events, or handling events out-of-order for a few seconds, etc. Thank

Re: Reprocessing old events no longer in Kafka

2015-05-29 Thread Navina Ramesh
Hi Zach, I agree. It is not a good idea to keep the entire set of historical data in Kafka. The retention period in Kafka does make it trickier to synchronize with your hadoop data pump. I am not very familiar with Camus2Kafka project. But that sounds like a workable solution. Ideal solution wou

Re: Reprocessing old events no longer in Kafka

2015-05-29 Thread Zach Cox
Hi Navina, Do you mean bootstrapping from hdfs as in [1]? That is an interesting idea I hadn't thought of. Maybe that could be combined with the offsets stored by Camus to determine the right place to transition to the real-time kafka stream? Thanks, Zach [1] http://samza.apache.org/learn/docume

Yarn scheduling

2015-05-29 Thread Roger Hoover
Hi, I notice that when YARN schedules my jobs, it loads up one machine completely before scheduling on the next. I'm using Capacity Scheduler with a default config. Is there a way to make it "round-robin" among the available machines? Thanks, Roger

Re: Reprocessing old events no longer in Kafka

2015-05-29 Thread Navina Ramesh
Hi Zach, Yes. Bootstrapping from hdfs directly is possible, provided you have a hdfs consumer defined (SAMZA-263 is a pre-requisite). :) If you have a way of getting hdfs data into Kafka, then you can use it as a bootstrap stream. However, a bootstrap stream is a temporarily privilege and always

Re: Reprocessing old events no longer in Kafka

2015-05-29 Thread Gustavo Anatoly
Hi, Zach In my humble opinion once I dont have much experience with Samza. But I think that a possible solution could be hybrid. Using HBase Coprocessor to running a Samza application to process the events and storing the results on HBase. Thus you have the history available to query it. Thanks.

Re: Review Request 34746: Adding new CoordinatorStreamMessage "SetContainerHostMapping" and LocalityManager (SAMZA-618)

2015-05-29 Thread Navina Ramesh
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34746/ --- (Updated May 29, 2015, 11:26 p.m.) Review request for samza, Chris Riccomini, G

RE: Yarn scheduling

2015-05-29 Thread Garry Turkington
Hi Roger, I've seen unbalanced container assignment across hosts but never one being maxed out before any others get any containers. So I'd look to the YARN config to start with. I believe though there will be a risk of this type of thing until YARN implements anti-affinity: https://issues.a