Re: Reading from sockets using dataset api

Arvid Heise Fri, 24 Apr 2020 00:12:44 -0700

Hi Kaan,

sorry, I haven't considered I/O as the bottleneck. I thought a bit more
about your issue and came to a rather simple solution.


How about you open a socket on each of your generator nodes? Then you
configure your analysis job to read from each of these sockets with a
separate source and union them before feeding them to the actual job?

You don't need to modify much on the analysis job and each source can be
independently read. WDYT?

On Fri, Apr 24, 2020 at 8:46 AM Kaan Sancak <kaans...@gmail.com> wrote:

> Thanks for the answer! Also thanks for raising some concerns about my
> question.
>
> Some of the graphs I have been using is larger than 1.5 tb, and I am
> currently an experiment stage of a project, and I am making modifications
> to my code and re-runing the experiments again. Currently, on some of the
> largest graphs I have been using, IO became an issue for me and keeps me
> wait for couple of hours.
>
> Moreover, I have a parallel/distributed graph generator, which I can run
> on the same set of nodes in my cluster. So what I wanted to do was, to run
> my Flink program and graph generator at the same time and feed the graph
> through generator, which should be faster than making IO from the disk. As
> you said, it is not essential for me to that, but I am trying to see what I
> am able to do using Flink and how can I solve such problems. I was also
> using another framework, and faced with the similar problem, I was able to
> reduce the graph read time from hours to minutes using this method.
>
>  Do you really have more main memory than disk space?
>
>
> My issue is actually not storage related, I am trying to see how can I
> reduce the IO time.
>
> One trick came to my mind is, creating dummy dataset, and using a map
> function on the dataset, I can open-up bunch of sockets and listen the
> generator, and collect the generated data. I am trying to see how it will
> turn out.
>
> Alternatively, if graph generation is rather cheap, you could also try to
> incorporate it directly into the analysis job.
>
>
> I am not familiar with the analysis jobs. I will look into it.
>
> Again, this is actually not a problem, I am just trying to experiment with
> the framework and see what I can do. I am very new to Flink, so my methods
> might be wrong. Thanks for the help!
>
> Best
> Kaan
>
>
> On Apr 23, 2020, at 10:51 AM, Arvid Heise <ar...@ververica.com> wrote:
>
> Hi Kaan,
>
> afaik there is no (easy) way to switch from streaming back to batch API
> while retaining all data in memory (correct me if I misunderstood).
>
> However, from your description, I also have some severe understanding
> problems. Why can't you dump the data to some file? Do you really have more
> main memory than disk space? Or do you have no shared memory between your
> generating cluster and the flink cluster?
>
> It almost sounds as if the issue at heart is rather to find a good
> serialization format on how to store the edges. The 70 billion edges could
> be stored in an array of id pairs, which amount to ~560 GB uncompressed
> data if stored in Avro (or any other binary serialization format) when ids
> are longs. That's not much by today's standards and could also be easily
> offloaded to S3.
>
> Alternatively, if graph generation is rather cheap, you could also try to
> incorporate it directly into the analysis job.
>
> On Wed, Apr 22, 2020 at 2:58 AM Kaan Sancak <kaans...@gmail.com> wrote:
>
>> Hi,
>>
>> I have been running some experiments on  large graph data, smallest graph
>> I have been using is around ~70 billion edges. I have a graph generator,
>> which generates the graph in parallel and feeds to the running system.
>> However, it takes a lot of time to read the edges, because even though the
>> graph generation process is parallel, in Flink I can only listen from
>> master node (correct me if I am wrong). Another option is dumping the
>> generated data to a file and reading with readFromCsv, however this is not
>> feasible in terms of storage management.
>>
>> What I want to do is, invoking my graph generator, using ipc/tcp
>> protocols  and reading the generated data from the sockets. Since the graph
>> data is also generated parallel in each node, I want to make use of ipc,
>> and read the data in parallel at each node. I made some online digging  but
>> couldn’t find something similar using dataset api. I would be glad if you
>> have some similar use cases or examples.
>>
>> Is it possible to use streaming environment to create the data in
>> parallel and switch to dataset api?
>>
>> Thanks in advance!
>>
>> Best
>> Kaan
>
>
>
> --
> Arvid Heise | Senior Java Developer
> <https://www.ververica.com/>
>
> Follow us @VervericaData
> --
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> Conference
> Stream Processing | Event Driven | Real Time
> --
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
> --
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
> (Toni) Cheng
>
>
>

-- 

Arvid Heise | Senior Java Developer

<https://www.ververica.com/>

Follow us @VervericaData

--

Join Flink Forward <https://flink-forward.org/> - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
(Toni) Cheng

Re: Reading from sockets using dataset api

Reply via email to