Re: Executing graph algorithms on Gelly that are larger then memmory

2016-11-28 Thread otherwise777
Small addition, i'm currently running the programs via my IDE intelij -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Executing-graph-algorithms-on-Gelly-that-are-larger-then-memmory-tp10358p10359.html Sent from the Apache Flink User Mailing

Executing graph algorithms on Gelly that are larger then memmory

2016-11-28 Thread otherwise777
I read somewhere that Flink and Gelly should be able to handle graph algorithms that require more space then the available memory, i'm currently getting java OutOfMemoryError heap space and if it would use disk space that wouldn't happen. Currently my algorithms use dense graphs with 10m edges, the

Re: Split csv file in equally sized slices (less than or equal)

2016-11-28 Thread Flavio Pompermaier
Great, thanks! On 28 Nov 2016 8:54 p.m., "Fabian Hueske" wrote: > Hi Flavio, > > sure. > This code should be close to what you need: > > public static class BatchingMapper implements MapPartitionFunction String[]> { > >int cnt = 0; >String[] batch = new String[1000]; > >@Override >

Re: Split csv file in equally sized slices (less than or equal)

2016-11-28 Thread Fabian Hueske
Hi Flavio, sure. This code should be close to what you need: public static class BatchingMapper implements MapPartitionFunction { int cnt = 0; String[] batch = new String[1000]; @Override public void mapPartition(Iterable values, Collector out) throws Exception { for(String v

Re: Split csv file in equally sized slices (less than or equal)

2016-11-28 Thread Flavio Pompermaier
Thanks for the support Fabian! I think I'll try the tumbling window method, it seems cleaner. Btw, just for the sake of completeness, can you show me a brief snippet (also in pseudocode) of a mapPartition that groups together elements into chunks of size n? Best, Flavio On Mon, Nov 28, 2016 at 8:

Re: Split csv file in equally sized slices (less than or equal)

2016-11-28 Thread Fabian Hueske
Hi Flavio, I think the easiest solution is to read the CSV file with the CsvInputFormat and use a subsequent MapPartition to batch 1000 rows together. In each partition, you might end up with an incomplete batch. However, I don't see yet how you can feed these batches into the JdbcInputFormat whic

Re: Problems with RollingSink

2016-11-28 Thread Kostas Kloudas
Hi Diego, The message shows that two tasks are trying to touch concurrently the same file. This message is thrown upon recovery after a failure, or at the initialization of the job? Could you please check the logs for other exceptions before this? Can this be related to this issue? https://www.

Split csv file in equally sized slices (less than or equal)

2016-11-28 Thread Flavio Pompermaier
Hi to all, I have a use case where I have to read a huge csv containing ids to fetch from a table in a db. The jdbc input format can handle parameterized queries so I was thinking to fetch data using 1000 id at a time. What is the easiest whay to divide a dataset by slices of 1000 ids each (in ord

Problems with RollingSink

2016-11-28 Thread Diego Fustes Villadóniga
Hi colleagues, I am experiencing problems when trying to write events from a stream to HDFS. I get the following exception: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): failed to create file /user/biguardian/events/2016-11-28--15/flinkpar

Re: How to let 1.1.3 not drop late events as verion 1.0.3 does

2016-11-28 Thread vinay patil
Hi Sendoh, I have used the Custom Trigger which is same as 1.0.3 EventTimeTrigger, and kept the allowedLateness value to Long.MAX_VALUE. Because of this change the late elements are not discarded and become single element windows Regards, Vinay Patil On Mon, Nov 28, 2016 at 5:54 PM, Sendoh [via

How to let 1.1.3 not drop late events as verion 1.0.3 does

2016-11-28 Thread Sendoh
Hi Flink users, Can I ask how to avoid default allowLateness(0) ? so that late events becomes single-element windows as 1.0.3 version acts? Best, Sendoh -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/How-to-let-1-1-3-not-drop-late-events

Re: Regarding time window based on the values received in the stream

2016-11-28 Thread Fabian Hueske
Hi, sorry for the late reply. There is a repository [1] with an example application that uses a custom trigger [2] (though together with a TimeWIndow and not with a GlobalWindow). I'm not aware of a repo with an example of a GlobalWIndow. Regarding the question about timestamps and watermarks: I

Re: JobManager shows TaskManager was lost/killed while TaskManger Process is still running and the network is OK.

2016-11-28 Thread Renkai
The zookeeper related logs are loged by user codes,I finally find the reason why the taskmanger was lost,that was I gave the taskmanager a big amount of memory, the jobmanager identify the taskmanager is down during the taskmanager in Full GC.Thanks for your help. -- View this message in context

Re: Problem - Negative currentWatermark if the watermark assignment is made before connecting the streams

2016-11-28 Thread Fabian Hueske
Hi Pedro, if I read you code correctly, you are not assigning timestamps and watermarks to the rules stream. Flink automatically derives watermarks from all streams involved. If you do not assign a watermark, the default is watermark is Long.MIN_VALUE which is exactly the value you are observing.

Re: DB connection and query inside map function

2016-11-28 Thread Fabian Hueske
Hi Anastasios, that's certainly possible. The most straight-forward approach would be a synchronous call to the database. Because only one request is active at the same time, you do not need a thread pool. You can establish the connection in the open() method of a RichMapFunction. The problem with

Re: multiple k-means in parallel

2016-11-28 Thread Fabian Hueske
Hi Lydia, that is certainly possible, however you need to adapt the algorithm a bit. The straight-forward approach would be to replicate the input data and assign IDs for each k-means run. If you have a data point (1, 2, 3) you could replicate it to three data points (10, 1, 2, 3), (15, 1, 2, 3),