Hi , We have a use case where one record needs to be in two different aggregations.
Say for example a credit card transaction "A", which belongs to transaction category ATM and crossborder. If I need to take the count of ATM transaction, I need to consider transaction A . For count of crossBorder transactions too I need to consider transaction A. If this has to run in parallel, we decided to go with data explosion. So that transaction A can be aggregate twice. Question: 1. Is Data explosion the only way to address it ? 2. The data has skew, so it runs out of executor memory when we tried to aggregate. Repartition after the data explosion to address the data skew is killing us. What other ways can we address this problem ? Note : A transaction is marked as an ATM transaction or a cross border transaction by a boolean value. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org