Re: Received an event in channel 0 while still having data from a record

2017-01-11 Thread M. Dale
How were the Parquet files you are trying to read generated? Same version of libraries? I am successfully using the following Scala code to read Parquet files using the HadoopInputFormat wrapper. Maybe try that in Java? val hadoopInputFormat = new HadoopInputFormat[Void, GenericRecord](new

Re: manual scaling with savepoint

2017-01-11 Thread gallenvara
Thanks for your detail explanation. -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/manual-scaling-with-savepoint-tp10974p10995.html Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Flink snapshotting to S3 - Timeout waiting for connection from pool

2017-01-11 Thread Shannon Carey
I'm having pretty frequent issues with the exception below. It basically always ends up killing my cluster after forcing a large number of job restarts. I just can't keep Flink up & running. I am running Flink 1.1.3 on EMR 5.2.0. I already tried updating the emrfs-site config fs.s3.maxConnectio

Received an event in channel 0 while still having data from a record

2017-01-11 Thread Newport, Billy
Anyone seen this before: Caused by: java.io.IOException: Received an event in channel 0 while still having data from a record. This indicates broken serialization logic. If you are using custom serialization code (Writable or Value types), check their serialization routines. In the case of Kryo

Re: Reading and Writing to S3

2017-01-11 Thread M. Dale
Samra,   As I was quickly looking at your code I only saw the ExecutionEnvironment from the read and not the StreamingExecutionEnvironment for the write. Glad to hear that this worked for batching.  Like you, I am very much a Flink beginner who just happened to have tried out the batch write to

Re: Reading and Writing to S3

2017-01-11 Thread Samra Kasim
Hi Markus, Thanks! This was very helpful! I realize what the issue is now. I followed what you did and I am able to write data to s3 if I do batch processing, but not stream processing. Do you know what the difference is and why it would work for one and not the other? Sam On Wed, Jan 11, 2017 a

Re: Custom writer with Rollingsink

2017-01-11 Thread Biswajit Das
Thank you for the reply . I have found the issue ,my bad I was trying to write from local intellij i local mode to remote HDFS, if I run execution mode it works fine now . On Wed, Jan 11, 2017 at 2:13 AM, Fabian Hueske wrote: > Hi, > > the exception says > "org.apache.hadoop.hdfs.protocol.Alrea

Re: Reading and Writing to S3

2017-01-11 Thread M. Dale
Sam,   Don't point the variables at files, point them at the directories containing the files. Do you have fs.s3.impl property defined? Concrete example: /home/markus/hadoop-config directory has one file "core-site.xml" with thefollowing content:             fs.s3.impl        org.apache.hadoop.f

Re: Making batches of small messages

2017-01-11 Thread Fabian Hueske
Hi, I think this is a case for the ProcessFunction that was recently added and will be included in Flink 1.2. ProcessFunction allows to register timers (so the 5 secs timeout can be addressed). You can maintain the fault tolerance guarantees if you collect the records in managed state. That way th

Making batches of small messages

2017-01-11 Thread Gwenhael Pasquiers
Hi, Sorry if this was already asked. For performances reasons (streaming as well as batch) I'd like to "group" messages (let's say by batches of 1000) before sending them to my sink (kafka, but mainly ES) so that I have a smaller overhead. I've seen the "countWindow" operation but if I'm not w

Re: Reading and Writing to S3

2017-01-11 Thread Samra Kasim
Hi Markus, Thanks for your help. I created an environment variable in IntelliJ for FLINK_CONF_DIR to point to the flink-conf.yaml and in it defined fs.hdfs.hadoopconf to point to the core-site.xml, but when I do that, I get the error: java.io.IOException: No file system found with scheme s3, refer

Reading compressed XML data

2017-01-11 Thread Sebastian Neef
Hi, what's the best way to read a compressed (bz2 / gz) XML file splitting it at a specific XML-tag? So far I've been using hadoop's TextInputFormat in combination with mahouts XmlInputFormat ([0]) with env.readHadoopFile(). Whereas the plain TextInputFormat can handle compressed data, the XmlInp

RE: Avro Parquet/Flink/Beam

2017-01-11 Thread Newport, Billy
Did you manage to push yet? Thanks -Original Message- From: Jean-Baptiste Onofré [mailto:j...@nanthrax.net] Sent: Tuesday, December 13, 2016 11:12 AM To: user@flink.apache.org Subject: Re: Avro Parquet/Flink/Beam Hi Billy, no, ParquetIO is in early stage and won't be included in 0.4.0

Re: Kafka KeyedStream source

2017-01-11 Thread Niels Basjes
Hi, Ok. I think I get it. WHAT IF: Assume we create a addKeyedSource(...) which will allow us to add a source that makes some guarantees about the data. And assume this source returns simply the Kafka partition id as the result of this 'hash' function. Then if I have 10 kafka partitions I would r

Fault tolerance guarantees of Elasticsearch sink in flink-elasticsearch2?

2017-01-11 Thread Andrew Roberts
Hello, I’m trying to understand the guarantees made by Flink’s Elasticsearch sink in terms of message delivery. according to (1), the ES sink offers at-least-once guarantees. This page doesn’t differentiate between flink-elasticsearch and flink-elasticsearch2, so I have to assume for the moment

Re: FLINK-5236. flink 1.1.3 and modifying the classpath to use another version of fasterxml.

2017-01-11 Thread Till Rohrmann
Hi Estela, it's correct that Flink's runtime has a dependency on com.fasterxml.jackson.core:jackson-core:jar:2.7.4:compile com.fasterxml.jackson.core:jackson-databind:jar:2.7.4:compile com.fasterxml.jackson.core:jackson-annotations:jar:2.7.4:compile First of all it's important to understand wher

Re: Increasing parallelism skews/increases overall job processing time linearly

2017-01-11 Thread Till Rohrmann
Hi CVP, changing the parallelism from 1 to 2 with every TM having only one slot will inevitably introduce another network shuffle operation between the sources and the keyed co flat map. This might be the source of your slow down, because before everything was running on one machine without any ne

Re: Sliding Event Time Window Processing: Window Function inconsistent behavior

2017-01-11 Thread Sujit Sakre
Hi Aljoscha, I have realized that the output stream is not defined separately in the code below, and hence the input values are getting in the sink. After defining a separate output stream it works. We have now confirmed that the windows are processed separately as per the groupings. Thanks. *

Re: About delta awareness caches

2017-01-11 Thread Aljoscha Krettek
Hi, (I'm just getting back from holidays, therefore the slow response. Sorry for that.) I think you can simulate the way Storm windows work by using a GlobalWindows assigner and having a custom Trigger and/or Evictor and also some special logic in your WindowFunction. About mergeable state, we're

Re: How to get help on ClassCastException when re-submitting a job

2017-01-11 Thread Fabian Hueske
Hi Guiliano, thanks for bringing up this issue. A "ClassCastException: X cannot be cast to X" often points to a classloader issue. So it might actually be a bug in Flink. I assume you submit the same application (same jar file) with the same command right? Did you cancel the job before resubmitti

Re: Custom writer with Rollingsink

2017-01-11 Thread Fabian Hueske
Hi, the exception says "org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): Failed to CREATE_FILE /y=2017/m=01/d=10/H=18/M=12/_part-0-0.in-progress for DFSClient_NONMAPREDUCE_1062142735_3". I would assume that your output format tries to create a file that already exists. Maybe you need

Re: manual scaling with savepoint

2017-01-11 Thread Fabian Hueske
Hi, Flink supports two types of state: 1) Key-partitioned state 2) Non-partitioned operator state (Checkpointed interface) Key-partitioned state is internally organized by key and can be "simply" rehashed. The actual implementation is more involved to make this efficient. This document contains d