Thanks as always, Paul.

Well, technically Drill performs *better* than the (official) Presto, because 
the prestosql/presto image's mongodb connector is broken at the current time...

I've been using the starburstdata/presto image because it's apparently more 
stable.  I'm still a little confused how they achieve that because I *thought* 
they were building from the same (prestosql) code base, with just some  
configuration changes, but I think they must've modified code as well.

Anyway, as for what kind of query, it's just a fairly vanilla query with a few 
range tests.

The (cleaned up and sanitized) Mongo query looks something like:

{ "$and" :
  [ 
    { "id" : { "$lt"  : 1000 } },
    { "d"  : { "$gte" : { "$date" : 1578000000000 } } },
    { "d"  : { "$lte" : { "$date" : 1578200000000 } } },
    { "x"  : { "$gte" : 1.0 } },
    { "x"  : { "$lte" : 4.0 } },
    { "y"  : { "$gte" : 5.0 } },
    { "y"  : { "$lte" : 8.0 } }
  ]
}

and then the SQL becomes something simple like:

SELECT * FROM mongodb.<schema>."<table.name>"
WHERE ( id < 1000 )
  AND ( d BETWEEN '2020-01-01T12:00:00Z' AND '2020-01-02T12:00:00Z' )
  AND ( x BETWEEN 1.0 AND 4.0 )
  AND ( y BETWEEN 5.0 AND 8.0 )
OFFSET 0 LIMIT 5000

> On March 5, 2020 at 2:31 AM Paul Rogers <[email protected]> wrote:
> 
> 
> Hi Ron,
> 
> Sounds like the good news is that Drill is about as good as Presto when 
> querying Mongo. Sounds like the bad news is that both are equally deficient. 
> On the other hand, the other good news is that better performance is just a 
> matter of adding additional planning rules (with perhaps some Mongo metadata.)
> 
> 
> The Wikipedia page for Mongo [1] suggests several features that Mongo (Simba) 
> is probably using in their own JDBC driver, but which Drill probably does not 
> use:
> 
> * Primary and secondary indices
> * Field, range query, and regular-expression searches
> * User-defined JavaScript functions
> * Three ways to perform aggregation: the aggregation pipeline, the map-reduce 
> function, and single-purpose aggregation methods.
> 
> My guess is that the Mongo JDBC driver does thorough planning to exploit each 
> of the above functions, while Drill may use only a few. We already noted 
> other weaknesses in the filter push-down code for the Drill Mongo plugin. 
> Seems fixable if we can put in the effort.
> 
> 
> Seems Mongo provides a Simba JDBC driver, which is proprietary, so no source 
> code is available we could use as a "cheat sheet" to see what's what.
> 
> 
> Just out of curiosity, what is the query that works well with the Mongo JDBC 
> driver, but poorly with Drill?
> 
> Anybody know more about how Mongo works and what Drill might be missing?
> 
> 
> Thanks,
> - Paul
> 
> [1] https://en.wikipedia.org/wiki/MongoDB
> 
> 
> 
>  
> 
>     On Wednesday, March 4, 2020, 9:28:44 PM PST, Ron Cecchini 
> <[email protected]> wrote:  
>  
>  Hi, guys.
> 
> This is actually more of a Mongo question than a Drill-specific question as 
> it also applies to Presto + Mongo, and the vanilla Mongo shell as well.
> 
> I'm asking here, though, because, well, I'm curious, and because you're the 
> database geniuses...
> 
> So, I essentially get why a NoSQL database, in general, wouldn't be as 
> performant as a SQL one at "relational" things.  From what I gather, there 
> are denormalization and optimization techniques and tricks you can use to 
> speed up a Mongo query and so forth, but my question is:
> 
> Why is it that any Drill/Presto + Mongo CLI or JDBC query against a large 
> collection (100-200 million documents) that includes even a single WHERE 
> clause, or the Mongo equivalent query made via Mongo shell, basically never 
> returns and has to be killed, whereas the same (Mongo equivalent) query 
> against the same collection made via *Mongo's* JDBC driver takes only a 
> second or two?
> 
> Is the Mongo JDBC using some indexing that the others aren't?  (But how would 
> that explain Mongo shell's non-performance...  Why doesn't Mongo shell just 
> make a JDBC call to the db...)
> 
> Thank you in advance for educating me.
> 
> Ron
>

Reply via email to