Re: Beam with Flink runner - Issues when writing to S3 in Parquet Format

2021-09-14 Thread David Morávek
Hi Sandeep, Jan has already provided pretty good guidelines for getting more context on the issue ;) Because this is not for the first time, I would like to raise awareness, that it's not OK to send a user related question to four Apache mailing list (that I know of). Namely: - u...@flink.apache

Re: [DISCUSS] Drop support for Flink 1.10

2021-05-31 Thread David Morávek
Hi, +1 as we've agreed to keep support for three latest major releases in the past D. On Mon, May 31, 2021 at 9:54 AM Jan Lukavský wrote: > Hi, > > +1 to remove the support for 1.10. > > Jan > On 5/28/21 10:00 PM, Ismaël Mejía wrote: > > Hello, > > With Beam support for Flink 1.13 just merged

Re: [DISCUSS] Drop support for Flink 1.8 and 1.9

2021-03-12 Thread David Morávek
+1 D. On Thu, Mar 11, 2021 at 8:33 PM Ismaël Mejía wrote: > +user > > > Should we add a warning or something to 2.29.0? > > Sounds like a good idea. > > > > > On Thu, Mar 11, 2021 at 7:24 PM Kenneth Knowles wrote: > > > > Should we add a warning or something to 2.29.0? > > > > On Thu, Mar 11,

Re: Beam flink runner job not keeping up with input rate after downscaling

2020-08-16 Thread David Morávek
Hello Sandeep, Are you seeing any skew in your data (affected TMs are receiving more data than others)? How many partitions does your source topic have (this could explain why some TMs would have more work to perform)? Also, would it be possible to retry your test with the latest SDK? D. On Sun

Re: Flink Runner with RequiresStableInput fails after a certain number of checkpoints

2020-04-21 Thread David Morávek
Hi Stephen, nice catch and awesome report! ;) This definitely needs a proper fix. I've created a new JIRA to track the issue and will try to resolve it soon as this seems critical to me. https://issues.apache.org/jira/browse/BEAM-9794 Thanks, D. On Mon, Apr 20, 2020 at 10:41 PM Stephen Patel w

Re: Multiple iterations after GroupByKey with SparkRunner

2019-09-27 Thread David Morávek
Hi, Spark's GBK is currently implemented using `sortBy(key and value).mapPartition(...)` for non-merging windowing in order to support large keys and large scale shuffles. Merging windowing is implemented using standard GBK (underlying spark impl. uses ListCombiner + Hash Grouping), which is by de

Re: Error obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due to an exception: No filesystem found for scheme s3

2019-08-10 Thread David Morávek
Hi, you are most likely missing dependency with s3 filesystem (beam-sdks-java-io-amazon-web-services). Best, D. Sent from my iPhone > On 10 Aug 2019, at 18:27, jitendra sharma wrote: > > Hi, > > I am getting below error reading files from S3. Could you please help me what > could be the pro

Re: [Python] Read Hadoop Sequence File?

2019-07-02 Thread David Morávek
Hi, you can use SequenceFileSink and Source, from a BigTable client. Those works nice with FileIO. https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSink.java https://

Re: [ANNOUNCE] Spark portable runner (batch) now available for Java, Python, Go

2019-06-19 Thread David Morávek
Great job Kyle, thanks for pushing this forward! Sent from my iPhone > On 18 Jun 2019, at 12:58, Ismaël Mejía wrote: > > I have been thrilled from seeing from the first row this happening. > > Thanks a lot Kyle. Excellent work! > > > On Mon, Jun 17, 2019 at 9:15 PM Ankur Goenka wrote: > > >

Re: Is there a way to decide what RDDs get cached in the Spark Runner?

2019-05-16 Thread David Morávek
Hello Augusto, This is a long standing problem that is really hard to fix properly, unless the runner has better understanding of what is actually happening in the pipeline itself (that's what Jan already pointed out). I still think the best approach, that would require least "knobs" for user to t

Re: kafka 0.9 support

2019-04-03 Thread David Morávek
t 11:27 PM Austin Bennett wrote: > I withdraw my concern -- checked on info on the cluster I will eventually > access. It is on 0.8, so I was speaking too soon. Can't speak to rest of > user base. > > On Tue, Apr 2, 2019 at 11:03 AM Raghu Angadi wrote: > >> Thanks to D

Re: joda-time dependency version

2019-03-23 Thread David Morávek
ch is in "-MM-dd HH:mm:ss" >> format, to "hh:mm:ss yy-MMM-dd z" format, the converted value is "01:56:12 >> 19-Mar-15 PDT". >> > > >> > > The javadoc for both the versions doesn't seem different though, for >> 'z

Re: joda-time dependency version

2019-03-21 Thread David Morávek
Hello Rahul, are there any incompatibilities you are running into with spark version? These versions should be backward compatible. For jodatime doc: The main public API will remain *backwards compatible* for both source and binary in the 2.x stream. This means you should be able to safely use Sp

Re: Moving to spark 2.4

2018-12-07 Thread David Morávek
+1 for waiting for HDP and CDH adoption Sent from my iPhone > On 7 Dec 2018, at 16:38, Alexey Romanenko wrote: > > I agree with Ismael and I’d wait until the new Spark version will be > supported by major BigData distributors. > >> On 7 Dec 2018, at 14:57, Vishwas Bm wrote: >> >> Hi Ismael,