Its important to note that running multiple streaming queries, as of today,
would read the input data that many number of time. So there is a trade off
between the two approaches.
So even though scenario 1 wont get great catalyst optimization, it may be
more efficient overall in terms of resource usage.

There may be an hybrid solution possible. You could craft multiple rules
using sql dsl. For N rules, you can have N boolean columns added with value
set based on each rule expressed through sql functions. Finally, the
foreach would take appropriate actions. A rough example would be.

dataframe
  .withColumn("rule1", when(...).otherwise(...))
  .withColumn("rule2", when(...).otherwise(...))
  ...
 .filter(...)      // filter out data where no rules were matched
 .as[RuleMatches].foreach { matches =>
    // take action for each rule matched
  }

This would evalue the rules with catalyst optimization, and apply
non-optimized foreach function ONLY on rows that matched some rule (which
is hopefully << total rows).



On Tue, Aug 8, 2017 at 11:12 PM, Jörn Franke <jornfra...@gmail.com> wrote:

> This is not easy to say without testing. It depends on type of computation
> etc. it also depends on the Spark version. Generally vectorization / SIMD
> could be much faster if it is applied by Spark / the JVM in scenario 2.
>
> > On 9. Aug 2017, at 07:05, Raghavendra Pandey <
> raghavendra.pan...@gmail.com> wrote:
> >
> > I am using structured streaming to evaluate multiple rules on same
> running stream.
> > I have two options to do that. One is to use forEach and evaluate all
> the rules on the row..
> > The other option is to express rules in spark sql dsl and run multiple
> queries.
> > I was wondering if option 1 will result in better performance even
> though I can get catalyst optimization in option 2.
> >
> > Thanks
> > Raghav
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to