Shaofeng SHI created KYLIN-3679:
-----------------------------------

             Summary: Fetch Kafka topic with Spark streaming
                 Key: KYLIN-3679
                 URL: https://issues.apache.org/jira/browse/KYLIN-3679
             Project: Kylin
          Issue Type: New Feature
          Components: Spark Engine
            Reporter: Shaofeng SHI


Now Kylin uses a MR job to fetch Kafka messages in parallel and then persist to 
HDFS for subsequent processing. If user selects to use Spark engine, we can use 
Spark streaming API to do this. Spark streaming can read the Kafka message in a 
given offset range as a RDD, then it would be easy to process;

https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html 

With Spark streaming, Kylin can also easily connect with other data source like 
Kinesis, Flume, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to