Its important to note that running multiple streaming queries, as of today, would read the input data that many number of time. So there is a trade off between the two approaches. So even though scenario 1 wont get great catalyst optimization, it may be more efficient overall in terms of resource usage.
There may be an hybrid solution possible. You could craft multiple rules using sql dsl. For N rules, you can have N boolean columns added with value set based on each rule expressed through sql functions. Finally, the foreach would take appropriate actions. A rough example would be. dataframe .withColumn("rule1", when(...).otherwise(...)) .withColumn("rule2", when(...).otherwise(...)) ... .filter(...) // filter out data where no rules were matched .as[RuleMatches].foreach { matches => // take action for each rule matched } This would evalue the rules with catalyst optimization, and apply non-optimized foreach function ONLY on rows that matched some rule (which is hopefully << total rows). On Tue, Aug 8, 2017 at 11:12 PM, Jörn Franke <jornfra...@gmail.com> wrote: > This is not easy to say without testing. It depends on type of computation > etc. it also depends on the Spark version. Generally vectorization / SIMD > could be much faster if it is applied by Spark / the JVM in scenario 2. > > > On 9. Aug 2017, at 07:05, Raghavendra Pandey < > raghavendra.pan...@gmail.com> wrote: > > > > I am using structured streaming to evaluate multiple rules on same > running stream. > > I have two options to do that. One is to use forEach and evaluate all > the rules on the row.. > > The other option is to express rules in spark sql dsl and run multiple > queries. > > I was wondering if option 1 will result in better performance even > though I can get catalyst optimization in option 2. > > > > Thanks > > Raghav > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >