Re: Increase stream parallelism after reading from UnboundedSource

2016-12-06 Thread Raghu Angadi
On Sun, Dec 4, 2016 at 11:48 PM, Amit Sela wrote: > For any downstream computation, is it common for stream processors to > "fan-out/parallelise" the stream by shuffling the data into more > streams/partitions/bundles ? > I think so. It is pretty common in batch processing too.

Re: KafkaIO.read withTimestampFn

2016-11-14 Thread Raghu Angadi
Hi Kobi, Missed this earlier. Could you describe how it is limiting you? The timestampFn depends on the type of key and value. Thats why once you set it, we don't allow modifying keyCoder() or valueCoder() etc. e.g. scenario we want to avoid (if we didn't have this restriction) : KafakIO.read()

Re: [PROPOSAL] Change to KafkaIO splits

2016-11-14 Thread Raghu Angadi
On Sun, Nov 13, 2016 at 10:14 PM, Davor Bonaci wrote: > Luke is bringing up great questions, I think. > Yes, better handling of 'desiredNumSplits' by a runner would be very useful. I wanted to limit my proposal to what a source like KafkaIO could do on its own. > My first impression is that th

Re: [PROPOSAL] Change to KafkaIO splits

2016-11-14 Thread Raghu Angadi
o a topic the pipeline > consumes) it's reader's state is non-existing (starting from > latest/earlies), while the rest (of the readers) will pick-up where they > left. > I think this also avoids the need to "remember" the original number of > parallelism. > &

[PROPOSAL] Change to KafkaIO splits

2016-11-10 Thread Raghu Angadi
I would like to propose a change to how many splits (sources) KafkaIO creates. The code changes are relatively simple, but it has a couple of drawbacks I would to discuss here. KafkaIO currently takes '*desiredNumWorkers

Re: Timer and Window behavior

2016-11-07 Thread Raghu Angadi
On Mon, Nov 7, 2016 at 12:41 AM, Demin Alexey wrote: > How workaround I tried set withWatermarkFn2 in Instant.now() > This work around should have unblocked DirectRunner as well. I haven't looked at UnboundedReadEvaluatorFactory:142.

Re: Timer and Window behavior

2016-11-06 Thread Raghu Angadi
On Sun, Nov 6, 2016 at 4:31 AM, Demin Alexey wrote: > This is bug or incorrect using API from my side? This is a bug in KafkaIO. It should advance the watermark when there are no messages to read. https://issues.apache.org/jira/browse/BEAM-591 I want to fix it, but may not get to it until Than

Re: [DISCUSS] Sources and Runners

2016-10-19 Thread Raghu Angadi
On Wed, Oct 19, 2016 at 11:00 AM, Kenneth Knowles wrote: > I wanted to attempt to explicitly answer Raghu's question by saying that I > think Dan's starting points imply that the recommended behavior for start() > and advance() is to be "non-blocking" in the sense that they return quickly > if in

Re: [DISCUSS] Sources and Runners

2016-10-19 Thread Raghu Angadi
; > > > > wrote: > > > > > > > > > > > > > > > > > Eventually we'll be able to communicate intent with the > > runner > > > > much > > > > > > > > > more directly via the ProcessContinuation

Re: [DISCUSS] Sources and Runners

2016-10-18 Thread Raghu Angadi
de and I think they > are worth this thread. > > *Background*: > The KafkaIO waits (5 seconds) before starting to read, and (10 millis) > between advancing the reader, which is problematic for the Spark runner as > it might attempt to read (every microbatch) for a shorter period, a

Re: Jenkins build is back to stable : beam_PostCommit_MavenVerify » Apache Beam :: Examples :: Java #1503

2016-10-12 Thread Raghu Angadi
Dan, Tracis-CI fails often too. Most of the time the failure is some travis-ci related timeouts. There are quite a few of these failures for https://github.com/apache/incubator-beam/pull/1071. On Wed, Oct 12, 2016 at 11:56 AM, Dan Halperin wrote: > Just an FYI that the issues here were legitima

Re: [DISCUSS] UnboundedSource and the KafkaIO.

2016-10-08 Thread Raghu Angadi
#x27;ll review the PR tomorrow. > > Thanks, > Amit > > On Sat, Oct 8, 2016 at 3:47 AM Raghu Angadi > wrote: > > > On Fri, Oct 7, 2016 at 4:55 PM, Amit Sela wrote: > > > > >3. Support reading of Kafka partitions that were added to topic/s > >

Re: [DISCUSS] UnboundedSource and the KafkaIO.

2016-10-07 Thread Raghu Angadi
On Fri, Oct 7, 2016 at 4:55 PM, Amit Sela wrote: >3. Support reading of Kafka partitions that were added to topic/s while >a Pipeline reads from them - BEAM-727 > was filed. > I think this is doable (assuming some caveats about generate

Re: [DISCUSS] UnboundedSource and the KafkaIO.

2016-10-07 Thread Raghu Angadi
Regd BEAM-704 : sent a PR https://github.com/apache/incubator-beam/pull/1071 for setting 'consumedOffset' in the reader without waiting to read the first record. On Fri, Oct 7, 2016 at 5:28 PM, Dan Halperin wrote: > >4. Reading/persisting Kafka start offsets - since Spark works in > >mic

Re: Should UnboundedSource provide a split identifier ?

2016-10-07 Thread Raghu Angadi
YARN it can run in "cluster-mode" - running the driver program in a YARN > container as well. > That's neat. I don't think Dataflow has such an option. Raghu. > > On Tue, Oct 4, 2016 at 10:39 PM Raghu Angadi > wrote: > > > On Wed, Sep 14, 2016 a

Re: [PROPOSAL] Introduce review mailing list and provide update on open discussion

2016-10-06 Thread Raghu Angadi
+1 for rev...@beam.incubator.apache.org. Open lists are critically important. My comment earlier was mainly about (4). Sorry about the not being clear. On Thu, Oct 6, 2016 at 11:00 AM, Lukasz Cwik wrote: > +1 for supporting different working styles. > > On Thu, Oct 6, 2016 at 10:58 AM, Kenneth

Re: [PROPOSAL] Introduce review mailing list and provide update on open discussion

2016-10-06 Thread Raghu Angadi
JB, Are there any examples of similar process for another Apache project? Providing regular updates of discussion happening another open list seems burdensome, especially for new contributors who come to project with large proposals. If a feature is large enough, may be a there needs to be a desi

Re: [REMINDER] Technical discussion on the mailing list

2016-10-05 Thread Raghu Angadi
+1 for AutoValue. On Wed, Oct 5, 2016 at 4:51 AM, Jean-Baptiste Onofré wrote: > So, any comment happening on a GitHub pull request, or discussion on > hangouts which can impact the project (generally speaking) has to happen on > the mailing list. > > It provides project transparency and facilita

Re: Should UnboundedSource provide a split identifier ?

2016-10-04 Thread Raghu Angadi
On Wed, Sep 14, 2016 at 1:43 PM, Amit Sela wrote: > > > > For generateInitialSplits, the UnboundedSource API doesn't require > > deterministic splitting (although it's recommended), and a PipelineRunner > > should keep track of the initially generated splits. > > > If the splitting were to be con

Re: Should UnboundedSource provide a split identifier ?

2016-10-04 Thread Raghu Angadi
On Tue, Sep 13, 2016 at 1:49 AM, Amit Sela wrote: > If I understand correctly this will break > https://github.com/apache/incubator-beam/blob/master/ > sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/ > kafka/KafkaIO.java#L857 > in > KafkaIO. > > So it's a KafkaIO limitation (for now ?) ?

Re: Should UnboundedSource provide a split identifier ?

2016-09-12 Thread Raghu Angadi
ts own splits. Can you give a concrete example? To me it looks like source and checkpoint objects are completely opaque to the runners. > On Wed, Sep 7, 2016 at 3:02 AM Raghu Angadi > wrote: > > > > If splits (UnboundedSources) had an identifier, this could be avoided, > >

Re: Should UnboundedSource provide a split identifier ?

2016-09-06 Thread Raghu Angadi
> If splits (UnboundedSources) had an identifier, this could be avoided, and checkpoints could be persisted accordingly. The order of the splits that a source returns is preserved. So during an update, you can expect 5th split gets invoked with the same checkpoint mark that 5th split saved before

Re: KafkaIO Windowing Fn

2016-09-02 Thread Raghu Angadi
On Tue, Aug 30, 2016 at 12:01 AM, Chawla,Sumit wrote: > Sorry i tried with DirectRunner but ran into some kafka issues. Following > is the snippet i am working on, and will post more details once i get it > working ( as of now i am unable to read messages from Kafka using > DirectRunner) > woul

Re: KafkaIO Windowing Fn

2016-08-25 Thread Raghu Angadi
On Thu, Aug 25, 2016 at 1:13 PM, Thomas Groh wrote: > > We should still have a JIRA to improve the KafkaIO watermark tracking in > the absence of new records . > filed https://issues.apache.org/jira/browse/BEAM-591 I don't want to hijack this thread Sumit's primary issue, but want to mention re

Re: KafkaIO Windowing Fn

2016-08-24 Thread Raghu Angadi
The default implementation returns processing timestamp of the last record (in effect. more accurately it returns same as getTimestamp(), which might overridden by user). As a work around, yes, you can provide your own watermarkFn that essentially returns Now() or Now()-1sec. (usage in javadoc

Re: java.io.NotSerializableException: org.apache.kafka.common.TopicPartition

2016-08-21 Thread Raghu Angadi
Thanks Sumit. We should probably update our kafka dependency to make sure 0.9.0.0 is excluded. On Sun, Aug 21, 2016 at 5:17 PM, Chawla,Sumit wrote: > Thanks Dan.. today i was able to identify what the issue was. Kafka > TopicPartition is marked Serializable in kafka-clients-0.9.0.1.jar. > Some

Re: Suggestion for Writing Sink Implementation

2016-07-29 Thread Raghu Angadi
On Fri, Jul 29, 2016 at 12:56 PM, Dan Halperin wrote: > > BEAM-452 and https://github.com/apache/incubator-beam/pull/690 > > Raghu, do you see this cache necessary once that work is in? nope. I didn't realize the feature is close to be merged. Thanks!

Re: Suggestion for Writing Sink Implementation

2016-07-29 Thread Raghu Angadi
gt; Sumit Chawla > > > On Fri, Jul 29, 2016 at 10:45 AM, Raghu Angadi > > wrote: > > > It is the preferred pattern I think. Is your source bounded or unbounded > > (i.e. streaming)? If it is latter, your sink could even be simpler than > > JB's. e.g. Kafka

Re: Suggestion for Writing Sink Implementation

2016-07-29 Thread Raghu Angadi
It is the preferred pattern I think. Is your source bounded or unbounded (i.e. streaming)? If it is latter, your sink could even be simpler than JB's. e.g. KafkaIO.write() where it just writes the messages to Kafka in processElement(). The pros are pretty clear : runner independent, pure Beam, sim

Re: Adding DoFn Setup and Teardown methods

2016-06-28 Thread Raghu Angadi
This is terrific! Thanks for the proposal. On Tue, Jun 28, 2016 at 9:06 AM, Thomas Groh wrote: > Hey Everyone: > > We've recently started to be permitted to reuse DoFn instances in Beam[1]. > Beyond the efficiency gains from not having to deserialize new DoFn > instances for every bundle, DoFn r

Re: Scala DSL

2016-06-26 Thread Raghu Angadi
On Fri, Jun 24, 2016 at 7:05 PM, Dan Halperin wrote: > > I love the > > name scio. But I think sdks/scala might be most appropriate and would > make > > it a first class citizen for Beam. > > > > I am strongly against it being in the 'sdks/' top-level module -- it's not > a Beam SDK. Unlike DSL,

Re: Scala DSL

2016-06-24 Thread Raghu Angadi
DSL is a pretty generic term.. The fact that scio uses Java SDK is an implementation detail. I love the name scio. But I think sdks/scala might be most appropriate and would make it a first class citizen for Beam. Where would a future python sdk reside? On Fri, Jun 24, 2016 at 1:50 PM, Jean-Bapt

Re: DoFn Reuse

2016-06-08 Thread Raghu Angadi
On Wed, Jun 8, 2016 at 10:39 AM, Ben Chambers wrote: > > To clarify -- this case is actually not allowed by the beam model. The > guarantee is that either a bundle is successfully completed (startBundle, > processElement*, finishBundle, commit) or not. If it isn't, then the bundle > is reprocesse

Re: DoFn Reuse

2016-06-08 Thread Raghu Angadi
On Wed, Jun 8, 2016 at 10:15 AM, Dan Halperin wrote: > > I thought finishBundle() > > exists simply as best-effort indication from the runner to user some > chunk > > of records have been processed.. not part of processing guarantees. Also > > the term "bundle" itself is fairly loosely defined (m

Re: DoFn Reuse

2016-06-08 Thread Raghu Angadi
On Wed, Jun 8, 2016 at 10:13 AM, Ben Chambers wrote: > - If failure occurs after finishBundle() but before the consumption is > committed, then the bundle may be reprocessed, which leads to duplicated > calls to processElement() and finishBundle(). > > - If failure occurs after consumption is

Re: DoFn Reuse

2016-06-08 Thread Raghu Angadi
On Wed, Jun 8, 2016 at 10:05 AM, Raghu Angadi wrote: > > I thought finishBundle() exists simply as best-effort indication from the > runner to user some chunk of records have been processed.. also to help with DoFn's own clean up if there is any.

Re: DoFn Reuse

2016-06-08 Thread Raghu Angadi
Such data loss can still occur if the worker dies after finishBundle() returns, but before the consumption is committed. I thought finishBundle() exists simply as best-effort indication from the runner to user some chunk of records have been processed.. not part of processing guarantees. Also the t

Re: [DISCUSS] Beam IO &runners native IO

2016-05-03 Thread Raghu Angadi
y, that is the only one. Still, checkpointing sinks in Beam > could be useful for users who are concerned about exactly once > semantics. I'm not sure whether we can implement something similar > with the bundle mechanism. > > On Mon, May 2, 2016 at 11:50 PM, Raghu Angadi >

Re: [DISCUSS] Beam IO &runners native IO

2016-05-02 Thread Raghu Angadi
What are good examples of streaming sinks that support checkpointing (or transactions/rollbacks)? I don't Kafka supports a rollback. On Mon, May 2, 2016 at 2:54 AM, Maximilian Michels wrote: > Yes, I would expect sinks to provide similar additional interfaces > like sources, e.g. checkpointing.

Re: [DISCUSS] Beam IO &runners native IO

2016-04-29 Thread Raghu Angadi
On Fri, Apr 29, 2016 at 2:11 AM, Maximilian Michels wrote: > Further, the KafkaIO enforces a data model which AFAIK is > not enforced by the Beam model. I don't know the details for this > design decision but I would like this to be communicated before it is > merged into the master. > structu

Re: [DISCUSS] Beam IO &runners native IO

2016-04-29 Thread Raghu Angadi
I agree with the sentiment here. Please note that runner specific sources continue to work as they do now. The original question was about '.useNative()' which requires generic to Beam sources to interact with the specific routers (on those lines). On Fri, Apr 29, 2016 at 2:11 AM, Maximilian Miche

Re: [DISCUSS] Beam IO &runners native IO

2016-04-28 Thread Raghu Angadi
Amir, KafkaIO is in this jar : https://repository.apache.org/content/groups/snapshots/org/apache/beam/kafka/0.1.0-incubating-SNAPSHOT/kafka-0.1.0-incubating-20160428.071230-8.jar how are building your app? How are fetching dependencies? On Thu, Apr 28, 2016 at 7:51 PM, amir bahmanyari < amirto..

Re: [DISCUSS] Beam IO &runners native IO

2016-04-28 Thread Raghu Angadi
Thanks JB. Amir, Only requirement with Beam KafkaIO is that it requires Kafka 0.9.x (already in production). It is not compatible with older kafka version. Raghu. On Thu, Apr 28, 2016 at 10:40 AM, Jean-Baptiste Onofré wrote: > Hi Amir, > > Now, we have a KafkaIO in Beam (both source and sink)

Re: [DISCUSS] Beam IO &runners native IO

2016-04-28 Thread Raghu Angadi
It would also be good to see examples of native I/O being noticeably more performant. The CassandraIO and PubsubIO are examples of corresponding Beam sources missing rather than Beam sources being slow. I think it is better to look into why Beam IO can not be performant. I think in most cases it c

Re: [PROPOSAL] New sdk languages

2016-03-24 Thread Raghu Angadi
I would love to see Scala API properly supported. I didn't know about scio. Scala is such a natural fit for Dataflow API. I am not sure of the policy w.r.t where such packages would live in Beam repo, but I personally would write my Dataflow applications in Scala. It is probably already the case b