[jira] [Commented] (BEAM-2185) KafkaIO bounded source
[ https://issues.apache.org/jira/browse/BEAM-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16014439#comment-16014439 ] Raghu Angadi commented on BEAM-2185: Something like {{withEndTime()}} is required, which sets the upper limit. We need a good way to set starting point. https://github.com/apache/beam/pull/3044 adds {{withStartTime()}. We have all the pieces for someone to work on this. {{withMaxTime()}} : This might be same as {{withEndTime()}}, If it is walltime, it would be problematic with task retries. {{withMaxRecords()}} : This could be max records per partition, we also need {{withStartOffset()}}, also requires roughly uniform partitions. > KafkaIO bounded source > -- > > Key: BEAM-2185 > URL: https://issues.apache.org/jira/browse/BEAM-2185 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Raghu Angadi > > KafkaIO could be a useful source for batch applications as well. It could > implement a bounded source. The primary question is how the bounds are > specified. > One option : Source specifies a time period (say 9am-10am), and KafkaIO > fetches appropriate start and end offsets based on time-index in Kafka. This > would suite many batch applications that are launched on a scheduled. > Another option is to always read till the end and commit the offsets to > Kafka. Handling failures and multiple runs of a task might be complicated. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2185) KafkaIO bounded source
[ https://issues.apache.org/jira/browse/BEAM-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013319#comment-16013319 ] Jean-Baptiste Onofré commented on BEAM-2185: As a first step, we can imagine to support {{withMaxTime()}} and {{withMaxRecords()}} as we do on most of the others unbounded IOs. WDYT ? > KafkaIO bounded source > -- > > Key: BEAM-2185 > URL: https://issues.apache.org/jira/browse/BEAM-2185 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Raghu Angadi > > KafkaIO could be a useful source for batch applications as well. It could > implement a bounded source. The primary question is how the bounds are > specified. > One option : Source specifies a time period (say 9am-10am), and KafkaIO > fetches appropriate start and end offsets based on time-index in Kafka. This > would suite many batch applications that are launched on a scheduled. > Another option is to always read till the end and commit the offsets to > Kafka. Handling failures and multiple runs of a task might be complicated. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2185) KafkaIO bounded source
[ https://issues.apache.org/jira/browse/BEAM-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005711#comment-16005711 ] Raghu Angadi commented on BEAM-2185: Some more considerations : - Splits: - Each partition should be split further. - If we could fetch the size in bytes, it is easier to split. I don't think it is feasible get byte offsets. - Split could be based on message offsets, but the size could be too large or too small depending on average message size. - Take a hint from the user if the size is not known? May be not. Read a few sample records? Probably an over kill. - It should support liquid-sharding (dynamic splitting). This does not need average message size. Just the offsets are good enough. > KafkaIO bounded source > -- > > Key: BEAM-2185 > URL: https://issues.apache.org/jira/browse/BEAM-2185 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Raghu Angadi > > KafkaIO could be a useful source for batch applications as well. It could > implement a bounded source. The primary question is how the bounds are > specified. > One option : Source specifies a time period (say 9am-10am), and KafkaIO > fetches appropriate start and end offsets based on time-index in Kafka. This > would suite many batch applications that are launched on a scheduled. > Another option is to always read till the end and commit the offsets to > Kafka. Handling failures and multiple runs of a task might be complicated. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2185) KafkaIO bounded source
[ https://issues.apache.org/jira/browse/BEAM-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005684#comment-16005684 ] Xu Mingmin commented on BEAM-2185: -- Good idea, I would prefer to option #1, that's the common case I use to ingest Kafka data into a batch process. > KafkaIO bounded source > -- > > Key: BEAM-2185 > URL: https://issues.apache.org/jira/browse/BEAM-2185 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Raghu Angadi > > KafkaIO could be a useful source for batch applications as well. It could > implement a bounded source. The primary question is how the bounds are > specified. > One option : source specifies a time period (say 9am-10am), and KafkaIO fetch > appropriate start and end offsets based on time-index in Kafka. This would > suite many batch applications that are lauched on a scheduled. > Another option is to always read till the end and commit the offsets to > Kafka. Handling failures and multiple runs of a task might be complicated. -- This message was sent by Atlassian JIRA (v6.3.15#6346)