Spark Streaming Topology
Hi all, I would like to use Spark Streaming for managing the problem below: I have 2 InputStreams, one for one type of input (n-dimensional vectors) and one for question on the infrastructure (explained below). I need to "break" the input first in 4 execution nodes, and produce a stream from each (4 streams) containing only m vectors that share common attributes (what is "common" comes from comparing incoming vectors with vectors already inside the "selected" group). If the selected group of m vectors changes by the addition of an new vector, I need to forward the selected group to an "aggregation node" that joins 2 execution nodes' selected groups (we have 2 such aggregation nodes). Following that, if an aggregation node has a change in its joined selected group of vectors, it forwards its joined selected vectors to an other aggregate node that contains the overall aggregation of the 2 aggregation nodes (JoinAll). Now, if a "question" arrives, a mapToPair and then Sort transformations need to be done on JoinAll, and print the result (one result for each question) Can anyone help me on this endeavour? I think, and correct me if I am mistaken, the architecture described needs to: 1. Partition Input Stream(1) to through DStream eachPartition = inputVectors.repartition(4) 2. Filter() eachPartition like DStream eachPartitionFiltered = eachPartition.filter(func) -> How can I use already persisted to node data for that [I mean the already "selected group" of that specific partition] 3. Group 2 out of 4 partitions to another DStream if needed and store it inside another node. [???] 4. if a "question" arrives, use the JoinAll DStream for answering the question. [???] -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Topology-tp24686.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark Streaming Topology
Hi all, I would like to use Spark Streaming for managing the problem below: I have 2 InputStreams, one for one type of input (n-dimensional vectors) and one for question on the infrastructure (explained below). I need to "break" the input first in 4 execution nodes, and produce a stream from each (4 streams) containing only m vectors that share common attributes (what is "common" comes from comparing incoming vectors with vectors already inside the "selected" group). If the selected group of m vectors changes by the addition of an new vector, I need to forward the selected group to an "aggregation node" that joins 2 execution nodes' selected groups (we have 2 such aggregation nodes). Following that, if an aggregation node has a change in its joined selected group of vectors, it forwards its joined selected vectors to an other aggregate node that contains the overall aggregation of the 2 aggregation nodes (JoinAll). Now, if a "question" arrives, a mapToPair and then Sort transformations need to be done on JoinAll, and print the result (one result for each question) Can anyone help me on this endeavour? I think, and correct me if I am mistaken, the architecture described needs to: 1. Partition Input Stream(1) to through DStream eachPartition = inputVectors.repartition(4) 2. Filter() eachPartition like DStream eachPartitionFiltered = eachPartition.filter(func) -> How can I use already persisted to node data for that [I mean the already "selected group" of that specific partition] 3. Group 2 out of 4 partitions to another DStream if needed and store it inside another node. [???] 4. if a "question" arrives, use the JoinAll DStream for answering the question. [???] Thank you in advance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Topology-tp24685.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org