回复：[Spark-Kafka-Streaming] Verifying the approach for multiple queries

tianlangstudio Sun, 09 Aug 2020 19:06:51 -0700

Hello, Sir! 
What about process and group the data first then write grouped data to Kafka 
topics A and B.   Then read topic A or B from another Spark Application and 
process it more. Like the term ETL's mean.


     
TianlangStudio
Some of the biggest lies: I will start tomorrow/Others are better than me/I am 
not good enough/I don't have time/This is the way I am
 


------------------------------------------------------------------
发件人：Amit Joshi <mailtojoshia...@gmail.com>
发送时间：2020年8月10日(星期一) 02:37
收件人：user <user@spark.apache.org>
主　题：[Spark-Kafka-Streaming] Verifying the approach for multiple queries

Hi,

I have a scenario where a kafka topic is being written with different types of 
json records.
I have to regroup the records based on the type and then fetch the schema and 
parse and write as parquet.
I have tried structured programming. But dynamic schema is a constraint.
So I have used DStreams and though I know the approach I have taken may not be 
good.
If anyone can pls let me know if the approach will scale and possible pros and 
cons.
I am collecting the grouped records and then again forming the dataframe for 
each grouped record.
createKeyValue -> This is creating the key value pair with schema information.
stream.foreachRDD { (rdd, time) =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
  val result = rdd.map(createKeyValue).reduceByKey((x,y) => x ++ y).collect()
  result.foreach(x=> println(x._1))
  result.map(x=> {
    val spark = 
SparkSession.builder().config(rdd.sparkContext.getConf).getOrCreate()
    import spark.implicits._
    import org.apache.spark.sql.functions._
    val df = x._2 toDF("value")
    df.select(from_json($"value", x._1._2, Map.empty[String,String]).as("data"))
      .select($"data.*")
      //.withColumn("entity", lit("invoice"))
      .withColumn("year",year($"TimeUpdated"))
      .withColumn("month",month($"TimeUpdated"))
      .withColumn("day",dayofmonth($"TimeUpdated"))
      
.write.partitionBy("name","year","month","day").mode("append").parquet(path)
  })
  stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}

github-logo.png
Description: Binary data

<<attachment: neteasy-cloud-study.jpg>>

51cto-logo.png
Description: Binary data

duxiaomai-logo (1).png
Description: Binary data

iqiyi-logo.png
Description: Binary data

huya-logo.png
Description: Binary data

logo-baidu-220X220.png
Description: Binary data

回复：[Spark-Kafka-Streaming] Verifying the approach for multiple queries

Reply via email to