Re: Ordering pushdown for Spark Datasources

2021-04-06 Thread Mich Talebzadeh
Lucene. I came across it years ago. Does Lucene support JDBC connection at all? How about Solr? HTH view my Linkedin profile *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction

Re: Ordering pushdown for Spark Datasources

2021-04-06 Thread Kohki Nishio
The log data is stored in Lucene and I have a custom data source to access it. For example, the condition is log-level = INFO, this brings in a couple of million records per partition. Then there are hundreds of partitions involved in a query. Spark has to go through all the entries to show the

Re: Ordering pushdown for Spark Datasources

2021-04-05 Thread Mich Talebzadeh
Hi, A couple of clarifications: 1. How is the log data stored on say HDFS? 2. You stated show the first 100 entries for a given condition. That condition is a predicate itself? There are articles for predicate pushdown in Spark. For example check Using Spark predicate push down in

Ordering pushdown for Spark Datasources

2021-04-04 Thread Kohki Nishio
Hello, I'm trying to use Spark SQL as a log analytics solution. As you might guess, for most use-cases, data is ordered by timestamp and the amount of data is large. If I want to show the first 100 entries (ordered by timestamp) for a given condition, Spark Executor has to scan the whole entries