Re: Allowing Unicode Whitespace in Lexer

2024-03-25 Thread Alex Cruise
While we're at it, maybe consider allowing "smart quotes" too :)

-0xe1a

On Sat, Mar 23, 2024 at 5:29 PM serge rielau.com  wrote:

> Hello,
>
> I have a PR https://github.com/apache/spark/pull/45620  ready to go that
> will extend the definition of whitespace (what separates token) from the
> small set of ASCII characters space, tab, linefeed to those defined in
> Unicode.
> While this is a small and safe change, it is one where we would have a
> hard time changing our minds about later.
> It is also a change that, AFAIK, cannot be controlled under a config.
>
> What does the community think?
>
> Cheers
> Serge
> SQL Architect at Databricks
>
>


Query hints visible to DSV2 connectors?

2023-08-02 Thread Alex Cruise
Hey folks,

I'm adding an optional feature to my DSV2 connector where it can choose
between a row-based or columnar PartitionReader dynamically depending on a
query's schema. I'd like to be able to supply a hint at query time that's
visible to the connector, but at the moment I can't see any way to
accomplish that.

>From what I can see the artifacts produced by the existing hint system [
https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-hints.html
or sql("select 1").hint("foo").show()] aren't visible from the
TableCatalog/Table/ScanBuilder.

I guess I could set a config parameter but I'd rather do this on a
per-query basis. Any tips?

Thanks!

-0xe1a


Re: Late materialization?

2023-05-31 Thread Alex Cruise
Just to clarify briefly, in hopes that future searchers will find this
thread... ;)

IIUC at the moment, partition pruning and column pruning are
all-or-nothing: every partition and every column either is, or is not, used
for a query.

Late materialization would mean that only the values needed for filtering &
aggregation would be read in the scan+filter stages, and any expressions
requested by the user but not needed for filtering and aggregation would
only be read/computed afterward.

I can see how this will invite sequential consistency problems, in data
sources where mutations like DML or compactions are happening behind the
query's back, but presumably Spark users already have this class of
problem, it's just less serious when the end-to-end execution time of a
query is shorter.

WDYT?

-0xe1a

On Wed, May 31, 2023 at 11:03 AM Alex Cruise  wrote:

> Hey folks, I'm building a Spark connector for my company's proprietary
> data lake... That project is going fine despite the near total lack of
> documentation. ;)
>
> In parallel, I'm also trying to figure out a better story for when humans
> inevitably `select * from 100_trillion_rows`, glance at the first page,
> then walk away forever. The traditional RDBMS approach seems to be to keep
> a lot of state in server-side cursors, so they can eagerly fetch only the
> first few pages of results and go to sleep until the user advances the
> cursor, at which point we wake up and fetch a few more pages.
>
> After some cursory googling about how Trino handles this nightmare
> scenario, I found https://github.com/trinodb/trino/issues/49 and its
> child https://github.com/trinodb/trino/pull/602, which appear to be based
> on the paper http://www.vldb.org/pvldb/vol4/p539-neumann.pdf, which is
> what HyPerDB (never open source, acquired by Tableau) was based on.
>
> IIUC this kind of optimization isn't really feasible in Spark at present,
> due to the sharp distinction between transforms, which are always lazy, and
> actions, which are always eager. However, given the very desirable
> performance/efficiency benefits, I think it's worth starting this
> conversation: if we wanted to do something like this, where would we start?
>
> Thanks!
>
> -0xe1a
>


Late materialization?

2023-05-31 Thread Alex Cruise
Hey folks, I'm building a Spark connector for my company's proprietary data
lake... That project is going fine despite the near total lack of
documentation. ;)

In parallel, I'm also trying to figure out a better story for when humans
inevitably `select * from 100_trillion_rows`, glance at the first page,
then walk away forever. The traditional RDBMS approach seems to be to keep
a lot of state in server-side cursors, so they can eagerly fetch only the
first few pages of results and go to sleep until the user advances the
cursor, at which point we wake up and fetch a few more pages.

After some cursory googling about how Trino handles this nightmare
scenario, I found https://github.com/trinodb/trino/issues/49 and its child
https://github.com/trinodb/trino/pull/602, which appear to be based on the
paper http://www.vldb.org/pvldb/vol4/p539-neumann.pdf, which is what
HyPerDB (never open source, acquired by Tableau) was based on.

IIUC this kind of optimization isn't really feasible in Spark at present,
due to the sharp distinction between transforms, which are always lazy, and
actions, which are always eager. However, given the very desirable
performance/efficiency benefits, I think it's worth starting this
conversation: if we wanted to do something like this, where would we start?

Thanks!

-0xe1a


planInputPartitions being called twice

2023-05-12 Thread Alex Cruise
(I posted this on Slack originally)

Hey folks, I’m writing a batch connector for an in-house data lake and
doing some performance work now… I’ve noticed my ScanBuilder creates a Scan
exactly once, but its toBatch method is being called three times, returning
the identical object every time, then the batch’s planInputPartitions
method is being called twice, doing a large amount of redundant work. I'm
targeting Spark 3.3.2 currently because EMR doesn't support Spark 3.4.x yet.

This is all a single node, local mode.  planInputPartitions() is itself a
somewhat expensive operation so I’d rather not have it being called twice.

I haven’t implemented SupportsRuntimeFiltering yet, but I’m not confident
it would help with this specific problem.

The javadoc for planInputPartitions says it’ll "be called only once, to
launch one Spark job",
OTOH
https://github.com/vertica/spark-connector/issues/171#issuecomment-1051162865
says it’s normal for it to be called twice

Well, at least it’s called on the same instance both times, so I can just
cache the results I guess… annoying though.

Is there a well-known better way to avoid this inefficiency? Is it a bug?

Thanks!

-0xe1a


Recent paper that might be relevant to pushdown and other optimizations

2023-04-21 Thread Alex Cruise
Optimizing Query Predicates with Disjunctions for Column Stores
https://arxiv.org/pdf/2002.00540.pdf  [abstract at the end of my message]

I just googled [predicate pushdown cnf] and it's WILD to me that this paper
came up in the first page of search results, and was published last year.
It mentions Spark briefly as an example of a system that "does not seem to
implement any additional optimizations [for disjunctions]"

Before posting this I googled [spark "2002.00540"] and it didn't appear
anyone in the Spark community was talking about it; forgive me if I've
missed something as I've only recently joined the list!

-0xe1a

*Abstract*
Since its inception, database research has given limited attention to
optimizing predicates with disjunctions. For conjunctions, there exists a
“rule-of-thumb” of evaluating predicates in increasing selectivity to
minimize unnecessary predicate evaluations. However, for disjunctions, no
such rule-of-thumb exists. Furthermore, what little past work there is, has
mostly focused on optimizations for traditional row-oriented databases. A
key difference in predicate evaluation for row stores and column stores is
that while row stores apply predicates to a single record at a time, column
stores apply predicates to sets of records. Not only must the execution
engine decide the order in which to apply the predicates, but it must also
decide how to combine these sets to minimize the total number of records
these predicates are applied to. Our goal for this work is to provide a
good “rule-of-thumb” algorithm for queries with both conjunctions and
disjunctions in a column store setting. We present EvalPred, the first
polynomial-time (i.e., O(n log n)) predicate evaluation algorithm with
provably optimal guarantees for all predicate expressions of nested depth 2
or less. EvalPred’s optimality is guaranteed under a wide range of cost
models, representing different real-world scenarios, as long as the cost
model follows a “triangle-inequality”-like property. Yet, despite its
powerful guarantees, EvalPred is almost trivially simple and should be easy
to implement, even in existing systems. Even for predicate expressions of
depth 3 or greater, we show via experimentation that EvalPred comes quite
close to optimal with 92% of queries within 5% of optimal. Furthermore,
compared to other algorithms, EvalPred achieves average speedups of 2.6×
over no disjunction optimization, 1.4× over a greedy algorithm, and 28×
over the state-of-the-art for the top 10% of queries.


Re: Adding new connectors

2023-03-27 Thread Alex Cruise
On Fri, Mar 24, 2023 at 11:23 AM Alex Cruise  wrote:

> I found ExternalCatalog a few days ago and have been implementing one of
> those, but it seems like DataSourceRegister / SupportsCatalogOptions is
> another popular approach. I'm not sure offhand how they overlap/intersect
> just yet.
>

I would love it if someone could comment on when implementing
ExternalCatalog is a good idea, vs. other approaches. :)

-0xe1a


>


Re: Adding new connectors

2023-03-24 Thread Alex Cruise
On Fri, Mar 24, 2023 at 3:18 PM John Zhuge  wrote:

> Is this similar to Iceberg's hidden partitioning
> ?
> Check out the details in the spec:
> https://iceberg.apache.org/spec/#partition-transforms
>

Yes, it's very similar to the main stated use case for hidden partitioning,
so I guess I'm pretty likely to find some useful homework to copy from in
the Iceberg spark connector source. :)

-0xe1a

>


Re: Adding new connectors

2023-03-24 Thread Alex Cruise
On Fri, Mar 24, 2023 at 1:46 PM John Zhuge  wrote:

> Have you checked out SparkCatalog
> 
>  in
> Apache Iceberg project? More docs at
> https://iceberg.apache.org/docs/latest/spark-configuration/#catalogs
>

No, I hadn't seen that one yet, thanks!

Another question: our partitions have no useful uniqueness criteria other
than a storage URL which should never be exposed to user-space. Our
"primary" index is a timestamp, and multiple partitions within a table can
have overlapping time ranges. We support an additional shard key but it's
optional. Is there something like partition discovery in DataSourceV2 where
I should list all the (potentially many thousands) of partitions for a
table, or can I leave them unpopulated until query planning time, when time
range predicates often have extremely high selectivity?

Thanks!

-0xe1a

>


Adding new connectors

2023-03-24 Thread Alex Cruise
Hey folks, please let me know this is more of a user@ post!

I'm building a Spark connector for my company's data-lake-ish product, and
it looks like there's very little documentation about how to go about it.

I found ExternalCatalog a few days ago and have been implementing one of
those, but it seems like DataSourceRegister / SupportsCatalogOptions is
another popular approach. I'm not sure offhand how they overlap/intersect
just yet.

I've also noticed a few implementations that put some of their code in
org.apache.spark.* packages in addition to their own; presumably this isn't
by accident. Is this practice necessary to get around package-private
visibility or something?

Thanks!

-0xe1a