Re: Ordering pushdown for Spark Datasources

Mich Talebzadeh Mon, 05 Apr 2021 01:02:11 -0700

Hi,

A couple of clarifications:



   1. How is the log data stored on say HDFS?
   2. You stated show the first 100 entries for a given condition. That
   condition is a predicate itself?

There are articles for predicate pushdown in Spark. For example check

Using Spark predicate push down in Spark SQL queries | DSE 6.0 Dev guide
(datastax.com)
<https://docs.datastax.com/en/dse/6.0/dse-dev/datastax_enterprise/spark/sparkPredicatePushdown.html#:~:text=A%20predicate%20push%20down%20filters,WHERE%20clauses%20to%20the%20database.>

Although large is a relative term. So that a couple of millions is not that
large. You can also try most of the following in spark-sql

spark-sql> set adaptive.enabled = true;
adaptive.enabled        true
Time taken: 0.011 seconds, Fetched 1 row(s)
spark-sql> set optimize.ppd=true;
optimize.ppd    true
Time taken: 0.011 seconds, Fetched 1 row(s)
spark-sql> set cbo.enables= true;
cbo.enables     true
Time taken: 0.01 seconds, Fetched 1 row(s)
spark-sql> set adaptive.enabled = true;
adaptive.enabled        true
Time taken: 0.01 seconds, Fetched 1 row(s)

Spark SQL is influenced by Hive SQL so you can leverage the pushdown in
Hive SQL.

Check this link as well

Spark SQL Performance Tuning by Configurations — SparkByExamples
<https://sparkbyexamples.com/spark/spark-sql-performance-tuning-configurations/>

HTH





   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 4 Apr 2021 at 23:55, Kohki Nishio <tarop...@gmail.com> wrote:

> Hello,
>
> I'm trying to use Spark SQL as a log analytics solution. As you might
> guess, for most use-cases, data is ordered by timestamp and the amount of
> data is large.
>
> If I want to show the first 100 entries (ordered by timestamp) for a given
> condition, Spark Executor has to scan the whole entries to select the
> top 100 by timestamp.
>
> I understand this behavior, however, some of the data sources such as JDBC
> or Lucene can support ordering and in this case, the target data is large
> (a couple of millions). I believe it is possible to pushdown orderings to
> the data sources and make the executors return early.
>
> Here's my ask, I know Spark doesn't do such a thing... but I'm looking for
> any pointers, references which might be relevant to this, or .. any random
> idea would be appreciated. So far I found, some folks are working on
> aggregation pushdown (SPARK-22390), but I don't see any current activity
> for ordering pushdown.
>
> Thanks
>
>
> --
> Kohki Nishio
>

Re: Ordering pushdown for Spark Datasources

Reply via email to