I have tried this sort of approach in other streaming cases I ran into and I believe the problem with this approach is
1) we got one stream (say stream1) going to disk say HDFS or a Database and we got another Stream (say stream2) where for every row in stream2 we make an I/O call to see if we can join with a row or rows in stream1 but this would be too many I/O calls if we were trying to make an I/O call for every row. 2) we could say we can make an I/O call per RDD partition in stream2 then there is a possibility that we run into Full Table Scan issues as data from stream1 gets big. so I wonder if anyone was able to implement this approach in production successfully(by which I mean making sure it is not resource intensive)? Thanks! On Sat, Jul 14, 2018 at 9:18 AM, Jörn Franke <jornfra...@gmail.com> wrote: > No, streaming dataframe needs to be written to disk or similar (or an > in-memory backend) then when the next stream arrive join them - create > graph and store the next stream together with the existing stream on disk > etc. > > On 14. Jul 2018, at 17:19, kant kodali <kanth...@gmail.com> wrote: > > The question now would be can it be done in streaming fashion? Are you > talking about the union of two streaming dataframes and then constructing a > graphframe (also during streaming) ? > > On Sat, Jul 14, 2018 at 8:07 AM, Jörn Franke <jornfra...@gmail.com> wrote: > >> For your use case one might indeed be able to work simply with >> incremental graph updates. However they are not straight forward in Spark. >> You can union the new Data with the existing dataframes that represent your >> graph and create from that a new graph frame. >> >> However I am not sure if this will fully fulfill your requirement for >> incremental graph updates. >> >> On 14. Jul 2018, at 15:59, kant kodali <kanth...@gmail.com> wrote: >> >> "You want to update incrementally an existing graph and run >> incrementally a graph algorithm suitable for this - you have to >> implement yourself as far as I am aware" >> >> I want to update the graph incrementally and want to run some graph >> queries similar to Cypher like give me all the vertices that are connected >> by a specific set of edges and so on. Don't really intend to run graph >> algorithms like ConnectedComponents or anything else at this point but of >> course, it's great to have. >> >> If we were to do this myself should I extend the GraphFrame? any >> suggestions? >> >> >> On Sun, Apr 29, 2018 at 3:24 AM, Jörn Franke <jornfra...@gmail.com> >> wrote: >> >>> What is the use case you are trying to solve? >>> You want to load graph data from a streaming window in separate graphs - >>> possible but requires probably a lot of memory. >>> You want to update an existing graph with new streaming data and then >>> fully rerun an algorithms -> look at Janusgraph >>> You want to update incrementally an existing graph and run incrementally >>> a graph algorithm suitable for this - you have to implement yourself as far >>> as I am aware >>> >>> > On 29. Apr 2018, at 11:43, kant kodali <kanth...@gmail.com> wrote: >>> > >>> > Do GraphFrames support streaming? >>> >> >> >