zhaP524 created SPARK-21948:
-------------------------------

             Summary: How to use spark streaming for deal two table from one 
topic of topic??
                 Key: SPARK-21948
                 URL: https://issues.apache.org/jira/browse/SPARK-21948
             Project: Spark
          Issue Type: Bug
          Components: Spark Core, Spark Submit, Structured Streaming
    Affects Versions: 2.1.1
         Environment: kafka:0.10.0
Spark:2.1.1
            Reporter: zhaP524


Now, I have such A requirement, I want from A topic of kafka、 A group of 
receiving data, receive to DirectStream contains two tables of data (A<Master>, 
B<Slave>), my ultimate goal is the two tables of data according to some fields 
join operation, will produce the results into the Database;

My previous operations are as follows:
1. Use batch processing directly at the DirectStream layer, filter out the data 
of A and B, produce their respective DF, and then use Spark SQL to perform the 
join operation, and the results are then entered into the library;But this kind 
of situation will exist problems, A and B table 1: N relationship, when 
selected A batch, may lead to A full load, and part B table loaded only, lead 
to the calculation results is only part of the next batch came in already 
consumed A data, table B subsequent data has no associated data, leading to 
loss of data;I don't know if you can store the A table data using the queue and 
so on, until the total load of B table data is loaded and processed to generate 
the complete result. In this case, the essence is similar to Spark Streaming 
Window Operation
2、
2, the use of Spark Streaming Window Operation for processing, I can isolate A 
and B table when DirectStream flow stream, but the join Operation shall be 
carried out in the Window, the Window Operation did not support the transform 
Operation such as map, lead to process cannot go 
through(java.io.NotSerializableException: 
org.apache.kafka.clients.consumer.ConsumerRecord
Serialization stack:
        - object not serializable (class: 
org.apache.kafka.clients.consumer.ConsumerRecord, value: ConsumerRecord(topic = 
doc, partition = 0,...), 
I don't know what should I do??

So, I was wondering if there was a problem with my scene?Or do I have a 
technical problem?Is there a solution to my business scenario without 
introducing other components?Trouble to give directions.TKS



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to