Yes, that sounds like a great idea and actually that's what I am trying to do.
> Then you configure your analysis job to read from each of these sockets with > a separate source and union them before feeding them to the actual job? Before trying to open the sockets on the slave nodes, first I have opened just one socket at master node, and I also run the generator with one node as well. I was able to read the graph, and the run my algorithm without any problems. This was a test run to see whatever I can do it. After, I have opened bunch of sockets on my generators, now I am trying to configure Flink to read from those sockets. However, I am having problems while trying to assign each task manager to a separate socket. I am assuming my problems are related to network binding. In my configuration file, jobmanager.rpc.address is set but I have not done similar configurations for slave nodes. Am I on the right track, or is there an easier way to handle this? I think my point is how to do `read from each of these sockets with a separate source` part. Thanks again Best Kaan > On Apr 24, 2020, at 3:11 AM, Arvid Heise <ar...@ververica.com> wrote: > > Hi Kaan, > > sorry, I haven't considered I/O as the bottleneck. I thought a bit more about > your issue and came to a rather simple solution. > > How about you open a socket on each of your generator nodes? Then you > configure your analysis job to read from each of these sockets with a > separate source and union them before feeding them to the actual job? > > You don't need to modify much on the analysis job and each source can be > independently read. WDYT? > > On Fri, Apr 24, 2020 at 8:46 AM Kaan Sancak <kaans...@gmail.com > <mailto:kaans...@gmail.com>> wrote: > Thanks for the answer! Also thanks for raising some concerns about my > question. > > Some of the graphs I have been using is larger than 1.5 tb, and I am > currently an experiment stage of a project, and I am making modifications to > my code and re-runing the experiments again. Currently, on some of the > largest graphs I have been using, IO became an issue for me and keeps me wait > for couple of hours. > > Moreover, I have a parallel/distributed graph generator, which I can run on > the same set of nodes in my cluster. So what I wanted to do was, to run my > Flink program and graph generator at the same time and feed the graph through > generator, which should be faster than making IO from the disk. As you said, > it is not essential for me to that, but I am trying to see what I am able to > do using Flink and how can I solve such problems. I was also using another > framework, and faced with the similar problem, I was able to reduce the graph > read time from hours to minutes using this method. > >> Do you really have more main memory than disk space? > > My issue is actually not storage related, I am trying to see how can I reduce > the IO time. > > One trick came to my mind is, creating dummy dataset, and using a map > function on the dataset, I can open-up bunch of sockets and listen the > generator, and collect the generated data. I am trying to see how it will > turn out. > >> Alternatively, if graph generation is rather cheap, you could also try to >> incorporate it directly into the analysis job. > > I am not familiar with the analysis jobs. I will look into it. > > Again, this is actually not a problem, I am just trying to experiment with > the framework and see what I can do. I am very new to Flink, so my methods > might be wrong. Thanks for the help! > > Best > Kaan > > >> On Apr 23, 2020, at 10:51 AM, Arvid Heise <ar...@ververica.com >> <mailto:ar...@ververica.com>> wrote: >> >> Hi Kaan, >> >> afaik there is no (easy) way to switch from streaming back to batch API >> while retaining all data in memory (correct me if I misunderstood). >> >> However, from your description, I also have some severe understanding >> problems. Why can't you dump the data to some file? Do you really have more >> main memory than disk space? Or do you have no shared memory between your >> generating cluster and the flink cluster? >> >> It almost sounds as if the issue at heart is rather to find a good >> serialization format on how to store the edges. The 70 billion edges could >> be stored in an array of id pairs, which amount to ~560 GB uncompressed data >> if stored in Avro (or any other binary serialization format) when ids are >> longs. That's not much by today's standards and could also be easily >> offloaded to S3. >> >> Alternatively, if graph generation is rather cheap, you could also try to >> incorporate it directly into the analysis job. >> >> On Wed, Apr 22, 2020 at 2:58 AM Kaan Sancak <kaans...@gmail.com >> <mailto:kaans...@gmail.com>> wrote: >> Hi, >> >> I have been running some experiments on large graph data, smallest graph I >> have been using is around ~70 billion edges. I have a graph generator, which >> generates the graph in parallel and feeds to the running system. However, it >> takes a lot of time to read the edges, because even though the graph >> generation process is parallel, in Flink I can only listen from master node >> (correct me if I am wrong). Another option is dumping the generated data to >> a file and reading with readFromCsv, however this is not feasible in terms >> of storage management. >> >> What I want to do is, invoking my graph generator, using ipc/tcp protocols >> and reading the generated data from the sockets. Since the graph data is >> also generated parallel in each node, I want to make use of ipc, and read >> the data in parallel at each node. I made some online digging but couldn’t >> find something similar using dataset api. I would be glad if you have some >> similar use cases or examples. >> >> Is it possible to use streaming environment to create the data in parallel >> and switch to dataset api? >> >> Thanks in advance! >> >> Best >> Kaan >> >> >> -- >> Arvid Heise | Senior Java Developer >> <https://www.ververica.com/> >> Follow us @VervericaData >> -- >> Join Flink Forward <https://flink-forward.org/> - The Apache Flink Conference >> Stream Processing | Event Driven | Real Time >> -- >> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany >> -- >> Ververica GmbH >> Registered at Amtsgericht Charlottenburg: HRB 158244 B >> Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji >> (Toni) Cheng > > > > -- > Arvid Heise | Senior Java Developer > <https://www.ververica.com/> > Follow us @VervericaData > -- > Join Flink Forward <https://flink-forward.org/> - The Apache Flink Conference > Stream Processing | Event Driven | Real Time > -- > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > -- > Ververica GmbH > Registered at Amtsgericht Charlottenburg: HRB 158244 B > Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji > (Toni) Cheng