I have tried this sort of approach in other streaming cases I ran into and
I believe the problem with this approach is

1) we got one stream (say stream1) going to disk say HDFS or a Database and
we got another Stream (say stream2) where for every row in stream2 we make
an I/O call to see if we can join with a row or rows in stream1 but this
would be too many I/O calls if we were trying to make an I/O call for every
row.
2) we could say we can make an I/O call per RDD partition in stream2 then
there is a possibility that we run into Full Table Scan issues as data from
stream1 gets big.

so I wonder if anyone was able to implement this approach in production
successfully(by which I mean making sure it is not resource intensive)?

Thanks!

On Sat, Jul 14, 2018 at 9:18 AM, Jörn Franke <jornfra...@gmail.com> wrote:

> No, streaming dataframe needs to be written to disk or similar (or an
> in-memory backend) then when the next stream arrive join them - create
> graph and store the next stream together with the existing stream on disk
> etc.
>
> On 14. Jul 2018, at 17:19, kant kodali <kanth...@gmail.com> wrote:
>
> The question now would be can it be done in streaming fashion? Are you
> talking about the union of two streaming dataframes and then constructing a
> graphframe (also during streaming) ?
>
> On Sat, Jul 14, 2018 at 8:07 AM, Jörn Franke <jornfra...@gmail.com> wrote:
>
>> For your use case one might indeed be able to work simply with
>> incremental graph updates. However they are not straight forward in Spark.
>> You can union the new Data with the existing dataframes that represent your
>> graph and create from that a new graph frame.
>>
>> However I am not sure if this will fully fulfill your requirement for
>> incremental graph updates.
>>
>> On 14. Jul 2018, at 15:59, kant kodali <kanth...@gmail.com> wrote:
>>
>> "You want to update incrementally an existing graph and run
>> incrementally a graph algorithm suitable for this - you have to
>> implement yourself as far as I am aware"
>>
>> I want to update the graph incrementally and want to run some graph
>> queries similar to Cypher like give me all the vertices that are connected
>> by a specific set of edges and so on. Don't really intend to run graph
>> algorithms like ConnectedComponents or anything else at this point but of
>> course, it's great to have.
>>
>> If we were to do this myself should I extend the GraphFrame? any
>> suggestions?
>>
>>
>> On Sun, Apr 29, 2018 at 3:24 AM, Jörn Franke <jornfra...@gmail.com>
>> wrote:
>>
>>> What is the use case you are trying to solve?
>>> You want to load graph data from a streaming window in separate graphs -
>>> possible but requires probably a lot of memory.
>>> You want to update an existing graph with new streaming data and then
>>> fully rerun an algorithms -> look at Janusgraph
>>> You want to update incrementally an existing graph and run incrementally
>>> a graph algorithm suitable for this - you have to implement yourself as far
>>> as I am aware
>>>
>>> > On 29. Apr 2018, at 11:43, kant kodali <kanth...@gmail.com> wrote:
>>> >
>>> > Do GraphFrames support streaming?
>>>
>>
>>
>

Reply via email to