Re: Flink+avro integration

2015-10-22 Thread Till Rohrmann
In the Java API, we only support the `max` operation for tuple types where you reference the fields via indices. Cheers, Till On Thu, Oct 22, 2015 at 4:04 PM, aawhitaker wrote: > Stephan Ewen wrote > > This is actually not a bug, or a POJO or Avro problem. It is simply a > > limitation in the f

reading csv file from null value

2015-10-22 Thread Philip Lee
Hi, I am trying to load the dataset with the part of null value by using readCsvFile(). // e.g _date|_click|_sales|_item|_web_page|_user case class WebClick(_click_date: Long, _click_time: Long, _sales: Int, _item: Int,_page: Int, _user: Int) private def getWebClickDataSet(env: ExecutionEnviro

Re: Zeppelin Integration

2015-10-22 Thread Till Rohrmann
Hi Trevor, that’s actually my bad since I only tested my branch against a remote cluster. I fixed the problem (not properly starting the LocalFlinkMiniCluster) so that you can now use Zeppelin also in local mode. Just check out my branch again. Cheers, Till ​ On Wed, Oct 21, 2015 at 10:00 PM, Tr

Re: Flink+avro integration

2015-10-22 Thread aawhitaker
Stephan Ewen wrote > This is actually not a bug, or a POJO or Avro problem. It is simply a > limitation in the functionality, as the exception message says: > "Specifying > fields by name is only supported on Case Classes (for now)." > > Try this with a regular reduce function that selects the max

RE: Multiple keys in reduceGroup ?

2015-10-22 Thread LINZ, Arnaud
Hi, Thanks a lot for the explanation. I cannot even say that it wasn’t stated in the documentation, I’ve simply missed the iterator part : “by default, user defined functions (like map() or reduce()) are getting new objects on each call (or through an iterator). So it is possible to keep refe

Running continuously on yarn with kerberos

2015-10-22 Thread Niels Basjes
Hi, I want to write a long running (i.e. never stop it) streaming flink application on a kerberos secured Hadoop/Yarn cluster. My application needs to do things with files on HDFS and HBase tables on that cluster so having the correct kerberos tickets is very important. The stream is to be ingeste

Re: Multiple keys in reduceGroup ?

2015-10-22 Thread Till Rohrmann
You don’t modify the objects, however, the ReusingKeyGroupedIterator, which is the iterator you have in your reduce function, does. Internally it uses two objects, in your case of type Tuple2, to deserialize the input records. These two objects are alternately returned when you call next on the ite

Re: Multiple keys in reduceGroup ?

2015-10-22 Thread Stephan Ewen
With object reuse activated, Flink heavily reuses objects. Each call to the Iterator in the reduceGroup function gives back one of the same two objects, with has been filled with different contents. Your list of all values will effectively only contain two different objects. Further more, the loo

RE: Multiple keys in reduceGroup ?

2015-10-22 Thread LINZ, Arnaud
Hi, I was using primitive types, and EnableObjectReuse was turned on. My next move was to turn it off, and it did solved the problem. It also increased execution time by 10%, but it’s hard to say if this overhead is due to the copy or to the change of behavior of the reduceGroup algorithm once

Re: Multiple keys in reduceGroup ?

2015-10-22 Thread Till Rohrmann
If not, could you provide us with the program and test data to reproduce the error? Cheers, Till On Thu, Oct 22, 2015 at 12:34 PM, Aljoscha Krettek wrote: > Hi, > but he’s comparing it to a primitive long, so shouldn’t the Long key be > unboxed and the comparison still be valid? > > My question

Re: Multiple keys in reduceGroup ?

2015-10-22 Thread Aljoscha Krettek
Hi, but he’s comparing it to a primitive long, so shouldn’t the Long key be unboxed and the comparison still be valid? My question is whether you enabled object-reuse-mode on the ExecutionEnvironment? Cheers, Aljoscha > On 22 Oct 2015, at 12:31, Stephan Ewen wrote: > > Hi! > > You are checki

Re: Multiple keys in reduceGroup ?

2015-10-22 Thread Stephan Ewen
Hi! You are checking for equality / inequality with "!=" - can you check with "equals()" ? The key objects will most certainly be different in each record (as they are deserialized individually), but they should be equal. Stephan On Thu, Oct 22, 2015 at 12:20 PM, LINZ, Arnaud wrote: > Hello,

Re: Reading multiple datasets with one read operation

2015-10-22 Thread Fabian Hueske
In principle, a data set the branches needs only to be materialized if both branches are pipelined until they are merged (i.e., in a hybrid-hash join). Otherwise, the data flow might deadlock due to pipelining. If you group both data sets before they are joined, the pipeline is broken due to the b

Multiple keys in reduceGroup ?

2015-10-22 Thread LINZ, Arnaud
Hello, Trying to understand why my code was giving strange results, I’ve ended up adding “useless” controls in my code and came with what seems to me a bug. I group my dataset according to a key, but in the reduceGroup function I am passed values with different keys. My code has the following

Re: Reading multiple datasets with one read operation

2015-10-22 Thread Pieter Hameete
Thanks for your responses! The derived datasets would indeed be grouped after the filter operations. Why would this cause them to be materialized to disk? And if I understand correctly the the data source will not chain to more than one filter, causing (de)serialization to transfer the records fro

Re: Reading multiple datasets with one read operation

2015-10-22 Thread Fabian Hueske
It might even be materialized (to disk) if both derived data sets are joined. 2015-10-22 12:01 GMT+02:00 Till Rohrmann : > I fear that the filter operations are not chained because there are at > least two of them which have the same DataSet as input. However, it's true > that the intermediate re

Re: Reading multiple datasets with one read operation

2015-10-22 Thread Till Rohrmann
I fear that the filter operations are not chained because there are at least two of them which have the same DataSet as input. However, it's true that the intermediate results are not materialized. It is also correct that the filter operators are deployed colocated to the data sources. Thus, there

Re: Reading multiple datasets with one read operation

2015-10-22 Thread Gábor Gévay
Hello! > I have thought about a workaround where the InputFormat would return > Tuple2s and the first field is the name of the dataset to which a record > belongs. This would however require me to filter the read data once for > each dataset or to do a groupReduce which is some overhead i'm > look

Re: Reading multiple datasets with one read operation

2015-10-22 Thread Till Rohrmann
Hi Pieter, at the moment there is no support to partition a `DataSet` into multiple sub sets with one pass over it. If you really want to have distinct data sets for each path, then you have to filter, afaik. Cheers, Till On Thu, Oct 22, 2015 at 11:38 AM, Pieter Hameete wrote: > Good morning!

Reading multiple datasets with one read operation

2015-10-22 Thread Pieter Hameete
Good morning! I have the following usecase: My program reads nested data (in this specific case XML) based on projections (path expressions) of this data. Often multiple paths are projected onto the same input. I would like each path to result in its own dataset. Is it possible to generate more

Re: Question regarding parallelism

2015-10-22 Thread Stephan Ewen
Hi! The bottom of this page also has an illustration of task to task slots. https://ci.apache.org/projects/flink/flink-docs-release-0.9/setup/config.html There are two optimizations involved: (1) Chaining: Here sources, mappers, filters are chained together. This is pretty classic, most systems