Thanks as always, Paul.
Well, technically Drill performs *better* than the (official) Presto, because
the prestosql/presto image's mongodb connector is broken at the current time...
I've been using the starburstdata/presto image because it's apparently more
stable. I'm still a little confused how they achieve that because I *thought*
they were building from the same (prestosql) code base, with just some
configuration changes, but I think they must've modified code as well.
Anyway, as for what kind of query, it's just a fairly vanilla query with a few
range tests.
The (cleaned up and sanitized) Mongo query looks something like:
{ "$and" :
[
{ "id" : { "$lt" : 1000 } },
{ "d" : { "$gte" : { "$date" : 1578000000000 } } },
{ "d" : { "$lte" : { "$date" : 1578200000000 } } },
{ "x" : { "$gte" : 1.0 } },
{ "x" : { "$lte" : 4.0 } },
{ "y" : { "$gte" : 5.0 } },
{ "y" : { "$lte" : 8.0 } }
]
}
and then the SQL becomes something simple like:
SELECT * FROM mongodb.<schema>."<table.name>"
WHERE ( id < 1000 )
AND ( d BETWEEN '2020-01-01T12:00:00Z' AND '2020-01-02T12:00:00Z' )
AND ( x BETWEEN 1.0 AND 4.0 )
AND ( y BETWEEN 5.0 AND 8.0 )
OFFSET 0 LIMIT 5000
> On March 5, 2020 at 2:31 AM Paul Rogers <[email protected]> wrote:
>
>
> Hi Ron,
>
> Sounds like the good news is that Drill is about as good as Presto when
> querying Mongo. Sounds like the bad news is that both are equally deficient.
> On the other hand, the other good news is that better performance is just a
> matter of adding additional planning rules (with perhaps some Mongo metadata.)
>
>
> The Wikipedia page for Mongo [1] suggests several features that Mongo (Simba)
> is probably using in their own JDBC driver, but which Drill probably does not
> use:
>
> * Primary and secondary indices
> * Field, range query, and regular-expression searches
> * User-defined JavaScript functions
> * Three ways to perform aggregation: the aggregation pipeline, the map-reduce
> function, and single-purpose aggregation methods.
>
> My guess is that the Mongo JDBC driver does thorough planning to exploit each
> of the above functions, while Drill may use only a few. We already noted
> other weaknesses in the filter push-down code for the Drill Mongo plugin.
> Seems fixable if we can put in the effort.
>
>
> Seems Mongo provides a Simba JDBC driver, which is proprietary, so no source
> code is available we could use as a "cheat sheet" to see what's what.
>
>
> Just out of curiosity, what is the query that works well with the Mongo JDBC
> driver, but poorly with Drill?
>
> Anybody know more about how Mongo works and what Drill might be missing?
>
>
> Thanks,
> - Paul
>
> [1] https://en.wikipedia.org/wiki/MongoDB
>
>
>
>
>
> On Wednesday, March 4, 2020, 9:28:44 PM PST, Ron Cecchini
> <[email protected]> wrote:
>
> Hi, guys.
>
> This is actually more of a Mongo question than a Drill-specific question as
> it also applies to Presto + Mongo, and the vanilla Mongo shell as well.
>
> I'm asking here, though, because, well, I'm curious, and because you're the
> database geniuses...
>
> So, I essentially get why a NoSQL database, in general, wouldn't be as
> performant as a SQL one at "relational" things. From what I gather, there
> are denormalization and optimization techniques and tricks you can use to
> speed up a Mongo query and so forth, but my question is:
>
> Why is it that any Drill/Presto + Mongo CLI or JDBC query against a large
> collection (100-200 million documents) that includes even a single WHERE
> clause, or the Mongo equivalent query made via Mongo shell, basically never
> returns and has to be killed, whereas the same (Mongo equivalent) query
> against the same collection made via *Mongo's* JDBC driver takes only a
> second or two?
>
> Is the Mongo JDBC using some indexing that the others aren't? (But how would
> that explain Mongo shell's non-performance... Why doesn't Mongo shell just
> make a JDBC call to the db...)
>
> Thank you in advance for educating me.
>
> Ron
>